diff --git a/.gitignore b/.gitignore
index 73d45c00e12351fc44b4a6d8f6eabb6ac0276c6b..857dc08d50177fc1495321a075a13f72ff998b56 100644
--- a/.gitignore
+++ b/.gitignore
@@ -21,6 +21,7 @@ tmp_dir/
 weights/
 pretrained_weights/
 model_weights/
+prepared_data/
 
 # Unit Tests/Coverage
 .coverage
diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000000000000000000000000000000000..224f7c56df9d577eb418fb7e8facd31c89bdd0ef
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,34 @@
+cff-version: "1.2.0"
+authors:
+- family-names: Zaurin
+  given-names: Javier Rodriguez
+  orcid: "https://orcid.org/0000-0002-1082-1107"
+- family-names: Mulinka
+  given-names: Pavol
+  orcid: "https://orcid.org/0000-0002-9394-8794"
+doi: 10.5281/zenodo.7908172
+message: If you use this software, please cite our article in the
+  Journal of Open Source Software.
+preferred-citation:
+  authors:
+  - family-names: Zaurin
+    given-names: Javier Rodriguez
+    orcid: "https://orcid.org/0000-0002-1082-1107"
+  - family-names: Mulinka
+    given-names: Pavol
+    orcid: "https://orcid.org/0000-0002-9394-8794"
+  date-published: 2023-06-24
+  doi: 10.21105/joss.05027
+  issn: 2475-9066
+  issue: 86
+  journal: Journal of Open Source Software
+  publisher:
+    name: Open Journals
+  start: 5027
+  title: "pytorch-widedeep: A flexible package for multimodal deep
+    learning"
+  type: article
+  url: "https://joss.theoj.org/papers/10.21105/joss.05027"
+  volume: 8
+title: "pytorch-widedeep: A flexible package for multimodal deep
+  learning"
\ No newline at end of file
diff --git a/README.md b/README.md
index 8ebf2e15c210a445d9f39b3e134315b38255984d..fd91b45092d371c6d5f1e7d6b129c8b657671764 100644
--- a/README.md
+++ b/README.md
@@ -12,6 +12,7 @@
 [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/jrzaurin/pytorch-widedeep/graphs/commit-activity)
 [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/jrzaurin/pytorch-widedeep/issues)
 [![Slack](https://img.shields.io/badge/slack-chat-green.svg?logo=slack)](https://join.slack.com/t/pytorch-widedeep/shared_invite/zt-soss7stf-iXpVuLeKZz8lGTnxxtHtTw)
+[![DOI](https://joss.theoj.org/papers/10.21105/joss.05027/status.svg)](https://doi.org/10.21105/joss.05027)
 
 # pytorch-widedeep
 
@@ -38,6 +39,9 @@ The content of this document is organized as follows:
     - [How to Contribute](#how-to-contribute)
     - [Acknowledgments](#acknowledgments)
     - [License](#license)
+    - [Cite](#cite)
+      - [BibTex](#bibtex)
+      - [APA](#apa)
 
 ### Introduction
 
@@ -82,7 +86,7 @@ without a ``deephead`` component can be formulated as:
 
 
 Where &sigma; is the sigmoid function, *'W'* are the weight matrices applied to the wide model and to the final
-activations of the deep models, *'a'* are these final activations, 
+activations of the deep models, *'a'* are these final activations,
 &phi;(x) are the cross product transformations of the original features *'x'*, and
 , and *'b'* is the bias term.
 In case you are wondering what are *"cross product transformations"*, here is
@@ -331,4 +335,31 @@ Vision](https://www.pyimagesearch.com/deep-learning-computer-vision-python-book/
 This work is dual-licensed under Apache 2.0 and MIT (or any later version).
 You can choose between one of them if you use this work.
 
-`SPDX-License-Identifier: Apache-2.0 AND MIT`
\ No newline at end of file
+`SPDX-License-Identifier: Apache-2.0 AND MIT`
+
+### Cite
+
+#### BibTex
+
+```
+@article{Zaurin_pytorch-widedeep_A_flexible_2023,
+author = {Zaurin, Javier Rodriguez and Mulinka, Pavol},
+doi = {10.21105/joss.05027},
+journal = {Journal of Open Source Software},
+month = jun,
+number = {86},
+pages = {5027},
+title = {{pytorch-widedeep: A flexible package for multimodal deep learning}},
+url = {https://joss.theoj.org/papers/10.21105/joss.05027},
+volume = {8},
+year = {2023}
+}
+```
+
+#### APA
+
+```
+Zaurin, J. R., & Mulinka, P. (2023). pytorch-widedeep: A flexible package for
+multimodal deep learning. Journal of Open Source Software, 8(86), 5027.
+https://doi.org/10.21105/joss.05027
+```
diff --git a/VERSION b/VERSION
index f0bb29e76388856b273698ae6064b0380ce5e5d2..3a3cd8cc8b079cb410a465d2925b9cbd703115cb 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-1.3.0
+1.3.1
diff --git a/examples/notebooks/19_wide_and_deep_for_recsys_pt1.ipynb b/examples/notebooks/19_wide_and_deep_for_recsys_pt1.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..b6a57e6dcd07dd90e3b4142246e89a57026e476b
--- /dev/null
+++ b/examples/notebooks/19_wide_and_deep_for_recsys_pt1.ipynb
@@ -0,0 +1,2303 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d298e185",
+   "metadata": {},
+   "source": [
+    "The goal of this, and the companion (part 2) notebooks is to illustrate how one could use this library in the context of recommendation systems. In particular, this notebook and the scripts at the `wide_deep_for_recsys` dir are a response to this [issue](https://github.com/jrzaurin/pytorch-widedeep/issues/133). Therefore, we will use the [Kaggle notebook](https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch) referred in that issue here.\n",
+    "\n",
+    "In order to keep the length of the notebook tractable, we will split this exercise in 2. In this first notebook we will prepare the [data](https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset) in almost the exact same way as it is done in the Kaggle notebook and also show how one could use `pytorch-widedeep` to build a model almost identical to the one in that notebook. \n",
+    "\n",
+    "In a second notebook, we will show how one could use this library to implement other models, still following the same problem formulation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "ebd9980d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import warnings\n",
+    "\n",
+    "import pandas as pd\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "from pytorch_widedeep.datasets import load_movielens100k"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "7cd76bce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "warnings.filterwarnings(\"ignore\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0aed611e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "save_path = Path(\"prepared_data\")\n",
+    "if not save_path.exists():\n",
+    "    save_path.mkdir(parents=True, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5de7a941",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data, users, items = load_movielens100k(as_frame=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7a288aee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Alternatively, as specified in the docs: 'The last 19 fields are the genres' so:\n",
+    "# list_of_genres = items.columns.tolist()[-19:]\n",
+    "list_of_genres = [\n",
+    "    \"unknown\",\n",
+    "    \"Action\",\n",
+    "    \"Adventure\",\n",
+    "    \"Animation\",\n",
+    "    \"Children's\",\n",
+    "    \"Comedy\",\n",
+    "    \"Crime\",\n",
+    "    \"Documentary\",\n",
+    "    \"Drama\",\n",
+    "    \"Fantasy\",\n",
+    "    \"Film-Noir\",\n",
+    "    \"Horror\",\n",
+    "    \"Musical\",\n",
+    "    \"Mystery\",\n",
+    "    \"Romance\",\n",
+    "    \"Sci-Fi\",\n",
+    "    \"Thriller\",\n",
+    "    \"War\",\n",
+    "    \"Western\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "929a9712",
+   "metadata": {},
+   "source": [
+    "Let's first start by loading the interactions, user and item data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "f4c09273",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>196</td>\n",
+       "      <td>242</td>\n",
+       "      <td>3</td>\n",
+       "      <td>881250949</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>186</td>\n",
+       "      <td>302</td>\n",
+       "      <td>3</td>\n",
+       "      <td>891717742</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>22</td>\n",
+       "      <td>377</td>\n",
+       "      <td>1</td>\n",
+       "      <td>878887116</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>244</td>\n",
+       "      <td>51</td>\n",
+       "      <td>2</td>\n",
+       "      <td>880606923</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>166</td>\n",
+       "      <td>346</td>\n",
+       "      <td>1</td>\n",
+       "      <td>886397596</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp\n",
+       "0      196       242       3  881250949\n",
+       "1      186       302       3  891717742\n",
+       "2       22       377       1  878887116\n",
+       "3      244        51       2  880606923\n",
+       "4      166       346       1  886397596"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "18c3faa0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>age</th>\n",
+       "      <th>gender</th>\n",
+       "      <th>occupation</th>\n",
+       "      <th>zip_code</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>53</td>\n",
+       "      <td>F</td>\n",
+       "      <td>other</td>\n",
+       "      <td>94043</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>23</td>\n",
+       "      <td>M</td>\n",
+       "      <td>writer</td>\n",
+       "      <td>32067</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>43537</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>33</td>\n",
+       "      <td>F</td>\n",
+       "      <td>other</td>\n",
+       "      <td>15213</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  age gender  occupation zip_code\n",
+       "0        1   24      M  technician    85711\n",
+       "1        2   53      F       other    94043\n",
+       "2        3   23      M      writer    32067\n",
+       "3        4   24      M  technician    43537\n",
+       "4        5   33      F       other    15213"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "users.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "1dbad7b1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>movie_title</th>\n",
+       "      <th>release_date</th>\n",
+       "      <th>video_release_date</th>\n",
+       "      <th>IMDb_URL</th>\n",
+       "      <th>unknown</th>\n",
+       "      <th>Action</th>\n",
+       "      <th>Adventure</th>\n",
+       "      <th>Animation</th>\n",
+       "      <th>Children's</th>\n",
+       "      <th>...</th>\n",
+       "      <th>Fantasy</th>\n",
+       "      <th>Film-Noir</th>\n",
+       "      <th>Horror</th>\n",
+       "      <th>Musical</th>\n",
+       "      <th>Mystery</th>\n",
+       "      <th>Romance</th>\n",
+       "      <th>Sci-Fi</th>\n",
+       "      <th>Thriller</th>\n",
+       "      <th>War</th>\n",
+       "      <th>Western</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>Toy Story (1995)</td>\n",
+       "      <td>01-Jan-1995</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>http://us.imdb.com/M/title-exact?Toy%20Story%2...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>GoldenEye (1995)</td>\n",
+       "      <td>01-Jan-1995</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>http://us.imdb.com/M/title-exact?GoldenEye%20(...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Four Rooms (1995)</td>\n",
+       "      <td>01-Jan-1995</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>http://us.imdb.com/M/title-exact?Four%20Rooms%...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>Get Shorty (1995)</td>\n",
+       "      <td>01-Jan-1995</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>http://us.imdb.com/M/title-exact?Get%20Shorty%...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>Copycat (1995)</td>\n",
+       "      <td>01-Jan-1995</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>http://us.imdb.com/M/title-exact?Copycat%20(1995)</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 24 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   movie_id        movie_title release_date  video_release_date  \\\n",
+       "0         1   Toy Story (1995)  01-Jan-1995                 NaN   \n",
+       "1         2   GoldenEye (1995)  01-Jan-1995                 NaN   \n",
+       "2         3  Four Rooms (1995)  01-Jan-1995                 NaN   \n",
+       "3         4  Get Shorty (1995)  01-Jan-1995                 NaN   \n",
+       "4         5     Copycat (1995)  01-Jan-1995                 NaN   \n",
+       "\n",
+       "                                            IMDb_URL  unknown  Action  \\\n",
+       "0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   \n",
+       "1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   \n",
+       "2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   \n",
+       "3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   \n",
+       "4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   \n",
+       "\n",
+       "   Adventure  Animation  Children's  ...  Fantasy  Film-Noir  Horror  Musical  \\\n",
+       "0          0          1           1  ...        0          0       0        0   \n",
+       "1          1          0           0  ...        0          0       0        0   \n",
+       "2          0          0           0  ...        0          0       0        0   \n",
+       "3          0          0           0  ...        0          0       0        0   \n",
+       "4          0          0           0  ...        0          0       0        0   \n",
+       "\n",
+       "   Mystery  Romance  Sci-Fi  Thriller  War  Western  \n",
+       "0        0        0       0         0    0        0  \n",
+       "1        0        0       0         1    0        0  \n",
+       "2        0        0       0         1    0        0  \n",
+       "3        0        0       0         0    0        0  \n",
+       "4        0        0       0         1    0        0  \n",
+       "\n",
+       "[5 rows x 24 columns]"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "items.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "3cb7bbc5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "      <th>num_watched</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>168</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>172</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1</td>\n",
+       "      <td>165</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965518</td>\n",
+       "      <td>3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "      <td>156</td>\n",
+       "      <td>4</td>\n",
+       "      <td>874965556</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1</td>\n",
+       "      <td>196</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965677</td>\n",
+       "      <td>5</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp  num_watched\n",
+       "0        1       168       5  874965478            1\n",
+       "1        1       172       5  874965478            2\n",
+       "2        1       165       5  874965518            3\n",
+       "3        1       156       4  874965556            4\n",
+       "4        1       196       5  874965677            5"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# adding a column with the number of movies watched per user\n",
+    "dataset = data.sort_values([\"user_id\", \"timestamp\"]).reset_index(drop=True)\n",
+    "dataset[\"one\"] = 1\n",
+    "dataset[\"num_watched\"] = dataset.groupby(\"user_id\")[\"one\"].cumsum()\n",
+    "dataset.drop(\"one\", axis=1, inplace=True)\n",
+    "dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "cf7c5da2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "      <th>num_watched</th>\n",
+       "      <th>mean_rate</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>168</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>172</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>2</td>\n",
+       "      <td>5.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1</td>\n",
+       "      <td>165</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965518</td>\n",
+       "      <td>3</td>\n",
+       "      <td>5.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "      <td>156</td>\n",
+       "      <td>4</td>\n",
+       "      <td>874965556</td>\n",
+       "      <td>4</td>\n",
+       "      <td>4.75</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1</td>\n",
+       "      <td>196</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965677</td>\n",
+       "      <td>5</td>\n",
+       "      <td>4.80</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp  num_watched  mean_rate\n",
+       "0        1       168       5  874965478            1       5.00\n",
+       "1        1       172       5  874965478            2       5.00\n",
+       "2        1       165       5  874965518            3       5.00\n",
+       "3        1       156       4  874965556            4       4.75\n",
+       "4        1       196       5  874965677            5       4.80"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# adding a column with the mean rating at a point in time per user\n",
+    "dataset[\"mean_rate\"] = (\n",
+    "    dataset.groupby(\"user_id\")[\"rating\"].cumsum() / dataset[\"num_watched\"]\n",
+    ")\n",
+    "dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29d1c399",
+   "metadata": {},
+   "source": [
+    "### Problem formulation\n",
+    "\n",
+    "In this particular exercise the problem is formulated as predicting the next movie that will be watched (in consequence the last interactions will be discarded)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "0e9d1315",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset[\"target\"] = dataset.groupby(\"user_id\")[\"movie_id\"].shift(-1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b38bba10",
+   "metadata": {},
+   "source": [
+    "Following the same processing used by the author in the before-mentioned Kaggle notebook, we build sequences of previous movies watched"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "f001f2b4",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "      <th>num_watched</th>\n",
+       "      <th>mean_rate</th>\n",
+       "      <th>target</th>\n",
+       "      <th>prev_movies</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>168</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>172.0</td>\n",
+       "      <td>[168]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>172</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>2</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>165.0</td>\n",
+       "      <td>[168, 172]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1</td>\n",
+       "      <td>165</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965518</td>\n",
+       "      <td>3</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>156.0</td>\n",
+       "      <td>[168, 172, 165]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "      <td>156</td>\n",
+       "      <td>4</td>\n",
+       "      <td>874965556</td>\n",
+       "      <td>4</td>\n",
+       "      <td>4.75</td>\n",
+       "      <td>196.0</td>\n",
+       "      <td>[168, 172, 165, 156]</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1</td>\n",
+       "      <td>196</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965677</td>\n",
+       "      <td>5</td>\n",
+       "      <td>4.80</td>\n",
+       "      <td>166.0</td>\n",
+       "      <td>[168, 172, 165, 156, 196]</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp  num_watched  mean_rate  target  \\\n",
+       "0        1       168       5  874965478            1       5.00   172.0   \n",
+       "1        1       172       5  874965478            2       5.00   165.0   \n",
+       "2        1       165       5  874965518            3       5.00   156.0   \n",
+       "3        1       156       4  874965556            4       4.75   196.0   \n",
+       "4        1       196       5  874965677            5       4.80   166.0   \n",
+       "\n",
+       "                 prev_movies  \n",
+       "0                      [168]  \n",
+       "1                 [168, 172]  \n",
+       "2            [168, 172, 165]  \n",
+       "3       [168, 172, 165, 156]  \n",
+       "4  [168, 172, 165, 156, 196]  "
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Here the author builds the sequences\n",
+    "dataset[\"prev_movies\"] = dataset[\"movie_id\"].apply(lambda x: str(x))\n",
+    "dataset[\"prev_movies\"] = (\n",
+    "    dataset.groupby(\"user_id\")[\"prev_movies\"]\n",
+    "    .apply(lambda x: (x + \" \").cumsum().str.strip())\n",
+    "    .reset_index(drop=True)\n",
+    ")\n",
+    "dataset[\"prev_movies\"] = dataset[\"prev_movies\"].apply(lambda x: x.split())\n",
+    "dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a024b9c4",
+   "metadata": {},
+   "source": [
+    "And now we add a `genre_rate` as the mean of all movies rated for a given genre per user\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "5782f0c9",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "      <th>num_watched</th>\n",
+       "      <th>mean_rate</th>\n",
+       "      <th>target</th>\n",
+       "      <th>prev_movies</th>\n",
+       "      <th>unknown</th>\n",
+       "      <th>Action</th>\n",
+       "      <th>...</th>\n",
+       "      <th>Fantasy_rate</th>\n",
+       "      <th>Film-Noir_rate</th>\n",
+       "      <th>Horror_rate</th>\n",
+       "      <th>Musical_rate</th>\n",
+       "      <th>Mystery_rate</th>\n",
+       "      <th>Romance_rate</th>\n",
+       "      <th>Sci-Fi_rate</th>\n",
+       "      <th>Thriller_rate</th>\n",
+       "      <th>War_rate</th>\n",
+       "      <th>Western_rate</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>168</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>172.0</td>\n",
+       "      <td>[168]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>172</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>2</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>165.0</td>\n",
+       "      <td>[168, 172]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.500000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1</td>\n",
+       "      <td>165</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965518</td>\n",
+       "      <td>3</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>156.0</td>\n",
+       "      <td>[168, 172, 165]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "      <td>156</td>\n",
+       "      <td>4</td>\n",
+       "      <td>874965556</td>\n",
+       "      <td>4</td>\n",
+       "      <td>4.75</td>\n",
+       "      <td>196.0</td>\n",
+       "      <td>[168, 172, 165, 156]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.250000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1</td>\n",
+       "      <td>196</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965677</td>\n",
+       "      <td>5</td>\n",
+       "      <td>4.80</td>\n",
+       "      <td>166.0</td>\n",
+       "      <td>[168, 172, 165, 156, 196]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.200000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 46 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp  num_watched  mean_rate  target  \\\n",
+       "0        1       168       5  874965478            1       5.00   172.0   \n",
+       "1        1       172       5  874965478            2       5.00   165.0   \n",
+       "2        1       165       5  874965518            3       5.00   156.0   \n",
+       "3        1       156       4  874965556            4       4.75   196.0   \n",
+       "4        1       196       5  874965677            5       4.80   166.0   \n",
+       "\n",
+       "                 prev_movies  unknown    Action  ...  Fantasy_rate  \\\n",
+       "0                      [168]      0.0  0.000000  ...           NaN   \n",
+       "1                 [168, 172]      0.0  0.500000  ...           NaN   \n",
+       "2            [168, 172, 165]      0.0  0.333333  ...           NaN   \n",
+       "3       [168, 172, 165, 156]      0.0  0.250000  ...           NaN   \n",
+       "4  [168, 172, 165, 156, 196]      0.0  0.200000  ...           NaN   \n",
+       "\n",
+       "   Film-Noir_rate  Horror_rate  Musical_rate  Mystery_rate  Romance_rate  \\\n",
+       "0             NaN          NaN           NaN           NaN           NaN   \n",
+       "1             NaN          NaN           NaN           NaN           5.0   \n",
+       "2             NaN          NaN           NaN           NaN           5.0   \n",
+       "3             NaN          NaN           NaN           NaN           5.0   \n",
+       "4             NaN          NaN           NaN           NaN           5.0   \n",
+       "\n",
+       "   Sci-Fi_rate  Thriller_rate  War_rate  Western_rate  \n",
+       "0          NaN            NaN       NaN           NaN  \n",
+       "1          5.0            NaN       5.0           NaN  \n",
+       "2          5.0            NaN       5.0           NaN  \n",
+       "3          5.0            4.0       5.0           NaN  \n",
+       "4          5.0            4.0       5.0           NaN  \n",
+       "\n",
+       "[5 rows x 46 columns]"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset = dataset.merge(items[[\"movie_id\"] + list_of_genres], on=\"movie_id\", how=\"left\")\n",
+    "for genre in list_of_genres:\n",
+    "    dataset[f\"{genre}_rate\"] = dataset[genre] * dataset[\"rating\"]\n",
+    "    dataset[genre] = dataset.groupby(\"user_id\")[genre].cumsum()\n",
+    "    dataset[f\"{genre}_rate\"] = (\n",
+    "        dataset.groupby(\"user_id\")[f\"{genre}_rate\"].cumsum() / dataset[genre]\n",
+    "    )\n",
+    "dataset[list_of_genres] = dataset[list_of_genres].apply(\n",
+    "    lambda x: x / dataset[\"num_watched\"]\n",
+    ")\n",
+    "dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7029510d",
+   "metadata": {},
+   "source": [
+    "Adding user features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "df698ec8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_id</th>\n",
+       "      <th>movie_id</th>\n",
+       "      <th>rating</th>\n",
+       "      <th>timestamp</th>\n",
+       "      <th>num_watched</th>\n",
+       "      <th>mean_rate</th>\n",
+       "      <th>target</th>\n",
+       "      <th>prev_movies</th>\n",
+       "      <th>unknown</th>\n",
+       "      <th>Action</th>\n",
+       "      <th>...</th>\n",
+       "      <th>Mystery_rate</th>\n",
+       "      <th>Romance_rate</th>\n",
+       "      <th>Sci-Fi_rate</th>\n",
+       "      <th>Thriller_rate</th>\n",
+       "      <th>War_rate</th>\n",
+       "      <th>Western_rate</th>\n",
+       "      <th>age</th>\n",
+       "      <th>gender</th>\n",
+       "      <th>occupation</th>\n",
+       "      <th>zip_code</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>168</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>172.0</td>\n",
+       "      <td>[168]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "      <td>172</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965478</td>\n",
+       "      <td>2</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>165.0</td>\n",
+       "      <td>[168, 172]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.500000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1</td>\n",
+       "      <td>165</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965518</td>\n",
+       "      <td>3</td>\n",
+       "      <td>5.00</td>\n",
+       "      <td>156.0</td>\n",
+       "      <td>[168, 172, 165]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "      <td>156</td>\n",
+       "      <td>4</td>\n",
+       "      <td>874965556</td>\n",
+       "      <td>4</td>\n",
+       "      <td>4.75</td>\n",
+       "      <td>196.0</td>\n",
+       "      <td>[168, 172, 165, 156]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.250000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1</td>\n",
+       "      <td>196</td>\n",
+       "      <td>5</td>\n",
+       "      <td>874965677</td>\n",
+       "      <td>5</td>\n",
+       "      <td>4.80</td>\n",
+       "      <td>166.0</td>\n",
+       "      <td>[168, 172, 165, 156, 196]</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.200000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>24</td>\n",
+       "      <td>M</td>\n",
+       "      <td>technician</td>\n",
+       "      <td>85711</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 50 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   user_id  movie_id  rating  timestamp  num_watched  mean_rate  target  \\\n",
+       "0        1       168       5  874965478            1       5.00   172.0   \n",
+       "1        1       172       5  874965478            2       5.00   165.0   \n",
+       "2        1       165       5  874965518            3       5.00   156.0   \n",
+       "3        1       156       4  874965556            4       4.75   196.0   \n",
+       "4        1       196       5  874965677            5       4.80   166.0   \n",
+       "\n",
+       "                 prev_movies  unknown    Action  ...  Mystery_rate  \\\n",
+       "0                      [168]      0.0  0.000000  ...           NaN   \n",
+       "1                 [168, 172]      0.0  0.500000  ...           NaN   \n",
+       "2            [168, 172, 165]      0.0  0.333333  ...           NaN   \n",
+       "3       [168, 172, 165, 156]      0.0  0.250000  ...           NaN   \n",
+       "4  [168, 172, 165, 156, 196]      0.0  0.200000  ...           NaN   \n",
+       "\n",
+       "   Romance_rate  Sci-Fi_rate  Thriller_rate  War_rate  Western_rate  age  \\\n",
+       "0           NaN          NaN            NaN       NaN           NaN   24   \n",
+       "1           5.0          5.0            NaN       5.0           NaN   24   \n",
+       "2           5.0          5.0            NaN       5.0           NaN   24   \n",
+       "3           5.0          5.0            4.0       5.0           NaN   24   \n",
+       "4           5.0          5.0            4.0       5.0           NaN   24   \n",
+       "\n",
+       "   gender  occupation  zip_code  \n",
+       "0       M  technician     85711  \n",
+       "1       M  technician     85711  \n",
+       "2       M  technician     85711  \n",
+       "3       M  technician     85711  \n",
+       "4       M  technician     85711  \n",
+       "\n",
+       "[5 rows x 50 columns]"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset = dataset.merge(users, on=\"user_id\", how=\"left\")\n",
+    "dataset.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee62d77e",
+   "metadata": {},
+   "source": [
+    "Again, we use the same settings as those in the Kaggle notebook, but `COLD_START_TRESH` is pretty aggressive"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "8060cf59",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "COLD_START_TRESH = 5\n",
+    "\n",
+    "filtred_data = dataset[\n",
+    "    (dataset[\"num_watched\"] >= COLD_START_TRESH) & ~(dataset[\"target\"].isna())\n",
+    "].sort_values(\"timestamp\")\n",
+    "train_data, _test_data = train_test_split(filtred_data, test_size=0.2, shuffle=False)\n",
+    "valid_data, test_data = train_test_split(_test_data, test_size=0.5, shuffle=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "b1beb347",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cols_to_drop = [\n",
+    "    # \"rating\",\n",
+    "    \"timestamp\",\n",
+    "    \"num_watched\",\n",
+    "]\n",
+    "\n",
+    "df_train = train_data.drop(cols_to_drop, axis=1)\n",
+    "df_valid = valid_data.drop(cols_to_drop, axis=1)\n",
+    "df_test = test_data.drop(cols_to_drop, axis=1)\n",
+    "\n",
+    "df_train.to_pickle(save_path / \"df_train.pkl\")\n",
+    "df_valid.to_pickle(save_path / \"df_valid.pkl\")\n",
+    "df_test.to_pickle(save_path / \"df_test.pkl\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5bf71a82",
+   "metadata": {},
+   "source": [
+    "Let's now build a model that is nearly identical to the one use in the[ Kaggle notebook](https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "6aa2e3f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import torch\n",
+    "from torch import nn\n",
+    "from scipy.sparse import coo_matrix\n",
+    "\n",
+    "from pytorch_widedeep import Trainer\n",
+    "from pytorch_widedeep.models import TabMlp, BasicRNN, WideDeep\n",
+    "from pytorch_widedeep.preprocessing import TabPreprocessor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "42b0d88f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "\n",
+    "save_path = Path(\"prepared_data\")\n",
+    "\n",
+    "PAD_IDX = 0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be204fe8",
+   "metadata": {},
+   "source": [
+    "Let's use some of the functions the author of the kaggle's notebook uses to prepare the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "206eb90e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_coo_indexes(lil):\n",
+    "    rows = []\n",
+    "    cols = []\n",
+    "    for i, el in enumerate(lil):\n",
+    "        if type(el) != list:\n",
+    "            el = [el]\n",
+    "        for j in el:\n",
+    "            rows.append(i)\n",
+    "            cols.append(j)\n",
+    "    return rows, cols\n",
+    "\n",
+    "\n",
+    "def get_sparse_features(series, shape):\n",
+    "    coo_indexes = get_coo_indexes(series.tolist())\n",
+    "    sparse_df = coo_matrix(\n",
+    "        (np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape\n",
+    "    )\n",
+    "    return sparse_df\n",
+    "\n",
+    "\n",
+    "def sparse_to_idx(data, pad_idx=-1):\n",
+    "    indexes = data.nonzero()\n",
+    "    indexes_df = pd.DataFrame()\n",
+    "    indexes_df[\"rows\"] = indexes[0]\n",
+    "    indexes_df[\"cols\"] = indexes[1]\n",
+    "    mdf = indexes_df.groupby(\"rows\").apply(lambda x: x[\"cols\"].tolist())\n",
+    "    max_len = mdf.apply(lambda x: len(x)).max()\n",
+    "    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7ca8dd42",
+   "metadata": {},
+   "source": [
+    "For the time being, we will not use a validation set for hyperparameter optimization, and we will simply concatenate the validation and the test set in one test set. I simply splitted the data into train/valid/test in case the reader wants to actually do hyperparameter optimization (and because I know in the future I will).\n",
+    "\n",
+    "There is also another caveat worth mentioning, related to the indexing of the movies. To build the matrices of movies watched, we use the entire dataset. A more realistic (and correct) approach would be to use ONLY the movies that appear in the training set and consider `unknown` or `unseen` those in the testing set that have not been seen during training. Nonetheless, this will not affect the purposes of this notebook, which is to illustrate how one could use `pytorch-widedeep` to build a recommendation algorithm. However, if one wanted to explore the performance of different algorithms in a \"proper\" way, these \"details\" need to be accounted for."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "39f778bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_test = pd.concat([df_valid, df_test], ignore_index=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "ab7483c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "id_cols = [\"user_id\", \"movie_id\"]\n",
+    "max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "3d17bd3d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train = df_train.drop(id_cols + [\"rating\", \"prev_movies\", \"target\"], axis=1)\n",
+    "y_train = np.array(df_train.target.values, dtype=\"int64\")\n",
+    "train_movies_watched = get_sparse_features(\n",
+    "    df_train[\"prev_movies\"], (len(df_train), max_movie_index + 1)\n",
+    ")\n",
+    "\n",
+    "X_test = df_test.drop(id_cols + [\"rating\", \"prev_movies\", \"target\"], axis=1)\n",
+    "y_test = np.array(df_test.target.values, dtype=\"int64\")\n",
+    "test_movies_watched = get_sparse_features(\n",
+    "    df_test[\"prev_movies\"], (len(df_test), max_movie_index + 1)\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "511e95ed",
+   "metadata": {},
+   "source": [
+    "let's have a look to the information in each dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "dd9e5ef3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>mean_rate</th>\n",
+       "      <th>unknown</th>\n",
+       "      <th>Action</th>\n",
+       "      <th>Adventure</th>\n",
+       "      <th>Animation</th>\n",
+       "      <th>Children's</th>\n",
+       "      <th>Comedy</th>\n",
+       "      <th>Crime</th>\n",
+       "      <th>Documentary</th>\n",
+       "      <th>Drama</th>\n",
+       "      <th>...</th>\n",
+       "      <th>Mystery_rate</th>\n",
+       "      <th>Romance_rate</th>\n",
+       "      <th>Sci-Fi_rate</th>\n",
+       "      <th>Thriller_rate</th>\n",
+       "      <th>War_rate</th>\n",
+       "      <th>Western_rate</th>\n",
+       "      <th>age</th>\n",
+       "      <th>gender</th>\n",
+       "      <th>occupation</th>\n",
+       "      <th>zip_code</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>25423</th>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.400000</td>\n",
+       "      <td>0.200000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.400000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.200000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>21</td>\n",
+       "      <td>M</td>\n",
+       "      <td>student</td>\n",
+       "      <td>48823</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25425</th>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.285714</td>\n",
+       "      <td>0.142857</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.428571</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.285714</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>21</td>\n",
+       "      <td>M</td>\n",
+       "      <td>student</td>\n",
+       "      <td>48823</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25424</th>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>0.166667</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.000000</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>21</td>\n",
+       "      <td>M</td>\n",
+       "      <td>student</td>\n",
+       "      <td>48823</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25426</th>\n",
+       "      <td>3.875000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.250000</td>\n",
+       "      <td>0.125000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.375000</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.250000</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>3.666667</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>21</td>\n",
+       "      <td>M</td>\n",
+       "      <td>student</td>\n",
+       "      <td>48823</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25427</th>\n",
+       "      <td>3.888889</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.222222</td>\n",
+       "      <td>0.111111</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "      <td>...</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>3.666667</td>\n",
+       "      <td>4.0</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>21</td>\n",
+       "      <td>M</td>\n",
+       "      <td>student</td>\n",
+       "      <td>48823</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 43 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       mean_rate  unknown    Action  Adventure  Animation  Children's  \\\n",
+       "25423   4.000000      0.0  0.400000   0.200000        0.0         0.0   \n",
+       "25425   4.000000      0.0  0.285714   0.142857        0.0         0.0   \n",
+       "25424   4.000000      0.0  0.333333   0.166667        0.0         0.0   \n",
+       "25426   3.875000      0.0  0.250000   0.125000        0.0         0.0   \n",
+       "25427   3.888889      0.0  0.222222   0.111111        0.0         0.0   \n",
+       "\n",
+       "         Comedy  Crime  Documentary     Drama  ...  Mystery_rate  \\\n",
+       "25423  0.400000    0.0          0.0  0.200000  ...           NaN   \n",
+       "25425  0.428571    0.0          0.0  0.285714  ...           NaN   \n",
+       "25424  0.333333    0.0          0.0  0.333333  ...           NaN   \n",
+       "25426  0.375000    0.0          0.0  0.250000  ...           NaN   \n",
+       "25427  0.333333    0.0          0.0  0.333333  ...           NaN   \n",
+       "\n",
+       "       Romance_rate  Sci-Fi_rate  Thriller_rate  War_rate  Western_rate  age  \\\n",
+       "25423           4.0          4.0       4.000000       4.0           NaN   21   \n",
+       "25425           4.0          4.0       4.000000       4.0           NaN   21   \n",
+       "25424           4.0          4.0       4.000000       4.0           NaN   21   \n",
+       "25426           4.0          4.0       3.666667       4.0           NaN   21   \n",
+       "25427           4.0          4.0       3.666667       4.0           NaN   21   \n",
+       "\n",
+       "       gender  occupation  zip_code  \n",
+       "25423       M     student     48823  \n",
+       "25425       M     student     48823  \n",
+       "25424       M     student     48823  \n",
+       "25426       M     student     48823  \n",
+       "25427       M     student     48823  \n",
+       "\n",
+       "[5 rows x 43 columns]"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "X_train.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "840e59a2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([772, 288, 108, ..., 183, 432, 509])"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "y_train"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "516d2fd5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<76228x1683 sparse matrix of type '<class 'numpy.float64'>'\n",
+       "\twith 7957390 stored elements in COOrdinate format>"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_movies_watched"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "a4cba74d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['173', '185', '255', '286', '298']"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sorted(df_train.prev_movies.tolist()[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "a4f11af4",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(array([0, 0, 0, 0, 0]), array([173, 185, 255, 286, 298]))"
+      ]
+     },
+     "execution_count": 27,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.where(train_movies_watched.todense()[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d7dd7bc",
+   "metadata": {},
+   "source": [
+    "And from now on is when the specifics related to this library start to appear. The only component that is going to be a bit different is the so-called tabular component, referred as `continuous` in the notebook. \n",
+    "\n",
+    "In the case of `pytorch-widedeep` we have the `TabPreprocessor` that allows for a lot of flexibility as to how we would like to process the tabular component of this Wide and Deep model. In other words, here our tabular component is a bit more elaborated than that in the notebook, just a bit...\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "733ea2a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat_cols = [\"gender\", \"occupation\", \"zip_code\"]\n",
+    "cont_cols = [c for c in X_train if c not in cat_cols]\n",
+    "tab_preprocessor = TabPreprocessor(\n",
+    "    cat_embed_cols=cat_cols,\n",
+    "    continuous_cols=cont_cols,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "68555183",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_tab = tab_preprocessor.fit_transform(X_train.fillna(0))\n",
+    "X_test_tab = tab_preprocessor.transform(X_test.fillna(0))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a00da28c",
+   "metadata": {},
+   "source": [
+    "Now, in the notebook, the author moves the sparse matrices to sparse tensors and then turns them into dense tensors. In reality, this is not neccessary, one could feed sparse tensors to `nn.Linear` layers in pytorch. Nonetheless, this is not the most efficient implementation and is the reason why in our library the wide, linear component is implemented as an embedding layer. \n",
+    "\n",
+    "Nonetheless, to reproduce the notebook the best we can and because currently the `Wide` model in `pytorch-widedeep` is not designed to receive sparse tensors (we might consider implementing this functionality), we will turn the sparse COO matrices into dense arrays. We will then code a fairly simple, custom `Wide` component."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "20903dd2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_wide = np.array(train_movies_watched.todense())\n",
+    "X_test_wide = np.array(test_movies_watched.todense())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "377e7f90",
+   "metadata": {},
+   "source": [
+    "Finally, the author of the notebook uses a simple `Embedding` layer to encode the sequences of movies watched, the `prev_movies` columns. In my opinion, there is an element of information redundancy here. This is because the wide and text components have implicitely the same information, but in different form. Moreover, both of the models used for these two components ignore the sequential element in the data. Nonetheless, we want to reproduce the Kaggle notebook as close as possible, AND as one can explore later (by simply performing simple ablation studies), the wide component seems to carry most of the predictive power."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "c52fd52c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_text = sparse_to_idx(train_movies_watched, pad_idx=PAD_IDX)\n",
+    "X_test_text = sparse_to_idx(test_movies_watched, pad_idx=PAD_IDX)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ca8b84d",
+   "metadata": {},
+   "source": [
+    "Let's now build the models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "44bc73d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Wide(nn.Module):\n",
+    "    def __init__(self, input_dim: int, pred_dim: int):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        self.input_dim = input_dim\n",
+    "        self.pred_dim = pred_dim\n",
+    "\n",
+    "        # When I coded the library I never though that someone would want to code\n",
+    "        # their own wide component. However, if you do, the wide component must have\n",
+    "        # a 'wide_linear' attribute. In other words, the linear layer must be\n",
+    "        # called 'wide_linear'\n",
+    "        self.wide_linear = nn.Linear(input_dim, pred_dim)\n",
+    "\n",
+    "    def forward(self, X):\n",
+    "        out = self.wide_linear(X.type(torch.float32))\n",
+    "        return out\n",
+    "\n",
+    "\n",
+    "wide = Wide(X_train_wide.shape[1], max_movie_index + 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "6f66130d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Wide(\n",
+       "  (wide_linear): Linear(in_features=1683, out_features=1683, bias=True)\n",
+       ")"
+      ]
+     },
+     "execution_count": 33,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "wide"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "25592d30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SimpleEmbed(nn.Module):\n",
+    "    def __init__(self, vocab_size: int, embed_dim: int, pad_idx: int):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        self.vocab_size = vocab_size\n",
+    "        self.embed_dim = embed_dim\n",
+    "        self.pad_idx = pad_idx\n",
+    "\n",
+    "        # The sequences of movies watched are simply embedded in the Kaggle\n",
+    "        # notebook. No RNN, Transformer or any model is used\n",
+    "        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)\n",
+    "\n",
+    "    def forward(self, X):\n",
+    "        embed = self.embed(X)\n",
+    "        embed_mean = torch.mean(embed, dim=1)\n",
+    "        return embed_mean\n",
+    "\n",
+    "    @property\n",
+    "    def output_dim(self) -> int:\n",
+    "        # All deep components in a custom 'pytorch-widedeep' model must have\n",
+    "        # an output_dim property\n",
+    "        return self.embed_dim\n",
+    "\n",
+    "\n",
+    "#  In the notebook the author uses simply embeddings\n",
+    "simple_embed = SimpleEmbed(max_movie_index + 1, 16, 0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "492f12c5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "SimpleEmbed(\n",
+       "  (embed): Embedding(1683, 16, padding_idx=0)\n",
+       ")"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "simple_embed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe9f137a",
+   "metadata": {},
+   "source": [
+    "Maybe one would like to use an RNN to account for the sequence nature of the problem. If that was the case it would be as easy as: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "0c3f17b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "basic_rnn = BasicRNN(\n",
+    "    vocab_size=max_movie_index + 1,\n",
+    "    embed_dim=16,\n",
+    "    hidden_dim=32,\n",
+    "    n_layers=2,\n",
+    "    rnn_type=\"gru\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e410d5d9",
+   "metadata": {},
+   "source": [
+    "And finally, the tabular component, which is the notebook is simply a stak of linear + Rely layers. In our case we have an embedding layer before the linear layers to encode categorial and numerical cols"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "ca721555",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tab_mlp = TabMlp(\n",
+    "    column_idx=tab_preprocessor.column_idx,\n",
+    "    cat_embed_input=tab_preprocessor.cat_embed_input,\n",
+    "    continuous_cols=tab_preprocessor.continuous_cols,\n",
+    "    cont_norm_layer=None,\n",
+    "    mlp_hidden_dims=[1024, 512, 256],\n",
+    "    mlp_activation=\"relu\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "25c25e3a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "TabMlp(\n",
+       "  (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(\n",
+       "    (cat_embed): DiffSizeCatEmbeddings(\n",
+       "      (embed_layers): ModuleDict(\n",
+       "        (emb_layer_gender): Embedding(3, 2, padding_idx=0)\n",
+       "        (emb_layer_occupation): Embedding(22, 9, padding_idx=0)\n",
+       "        (emb_layer_zip_code): Embedding(648, 60, padding_idx=0)\n",
+       "      )\n",
+       "      (embedding_dropout): Dropout(p=0.1, inplace=False)\n",
+       "    )\n",
+       "    (cont_norm): Identity()\n",
+       "  )\n",
+       "  (encoder): MLP(\n",
+       "    (mlp): Sequential(\n",
+       "      (dense_layer_0): Sequential(\n",
+       "        (0): Dropout(p=0.1, inplace=False)\n",
+       "        (1): Linear(in_features=111, out_features=1024, bias=True)\n",
+       "        (2): ReLU(inplace=True)\n",
+       "      )\n",
+       "      (dense_layer_1): Sequential(\n",
+       "        (0): Dropout(p=0.1, inplace=False)\n",
+       "        (1): Linear(in_features=1024, out_features=512, bias=True)\n",
+       "        (2): ReLU(inplace=True)\n",
+       "      )\n",
+       "      (dense_layer_2): Sequential(\n",
+       "        (0): Dropout(p=0.1, inplace=False)\n",
+       "        (1): Linear(in_features=512, out_features=256, bias=True)\n",
+       "        (2): ReLU(inplace=True)\n",
+       "      )\n",
+       "    )\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 38,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tab_mlp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b68c5bc9",
+   "metadata": {},
+   "source": [
+    "Finally, we simply wrap up all models with the `WideDeep` 'collector' class and we are ready to train. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "4c6acc08",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "wide_deep_model = WideDeep(\n",
+    "    wide=wide, deeptabular=tab_mlp, deeptext=simple_embed, pred_dim=max_movie_index + 1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "bc8970f7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "WideDeep(\n",
+       "  (wide): Wide(\n",
+       "    (wide_linear): Linear(in_features=1683, out_features=1683, bias=True)\n",
+       "  )\n",
+       "  (deeptabular): Sequential(\n",
+       "    (0): TabMlp(\n",
+       "      (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(\n",
+       "        (cat_embed): DiffSizeCatEmbeddings(\n",
+       "          (embed_layers): ModuleDict(\n",
+       "            (emb_layer_gender): Embedding(3, 2, padding_idx=0)\n",
+       "            (emb_layer_occupation): Embedding(22, 9, padding_idx=0)\n",
+       "            (emb_layer_zip_code): Embedding(648, 60, padding_idx=0)\n",
+       "          )\n",
+       "          (embedding_dropout): Dropout(p=0.1, inplace=False)\n",
+       "        )\n",
+       "        (cont_norm): Identity()\n",
+       "      )\n",
+       "      (encoder): MLP(\n",
+       "        (mlp): Sequential(\n",
+       "          (dense_layer_0): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=111, out_features=1024, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_1): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=1024, out_features=512, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_2): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=512, out_features=256, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "    (1): Linear(in_features=256, out_features=1683, bias=True)\n",
+       "  )\n",
+       "  (deeptext): Sequential(\n",
+       "    (0): SimpleEmbed(\n",
+       "      (embed): Embedding(1683, 16, padding_idx=0)\n",
+       "    )\n",
+       "    (1): Linear(in_features=16, out_features=1683, bias=True)\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "wide_deep_model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e08d41ed",
+   "metadata": {},
+   "source": [
+    "Note that the main difference between this wide and deep model and the Wide and Deep model in the Kaggle notebook is that in that notebook, the author concatenates the embedings and the tabular features, then passes this concatenation through a stack of linear + Relu layers with a final output dim of 256. Then concatenates this output with the binary features and connects this concatenation with the final linear layer (so the final weights are of dim (batch_size, 256 + 1683)). Our implementation follows the notation of the original paper and instead of concatenating the tabular, text and wide components and then connect them to the output neurons, we first compute their output, and then add it (see here: https://arxiv.org/pdf/1606.07792.pdf, their Eq 3). Note that this is effectively the same, with the caveat that while in one case one initialises a big weight matrix \"at once\", in our implementation we initialise different matrices for different components. Anyway, let's give it a go."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "538a34de",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=wide_deep_model,\n",
+    "    objective=\"multiclass\",\n",
+    "    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),\n",
+    "    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "77c02ed5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "epoch 1: 100%|█████████████████████████████████████████████████████████████████████████| 149/149 [00:16<00:00,  8.82it/s, loss=6.66]\n",
+      "valid: 100%|█████████████████████████████████████████████████████████████████████████████| 38/38 [00:01<00:00, 21.14it/s, loss=6.61]\n",
+      "epoch 2: 100%|█████████████████████████████████████████████████████████████████████████| 149/149 [00:17<00:00,  8.53it/s, loss=5.98]\n",
+      "valid: 100%|█████████████████████████████████████████████████████████████████████████████| 38/38 [00:01<00:00, 23.20it/s, loss=6.53]\n",
+      "epoch 3: 100%|█████████████████████████████████████████████████████████████████████████| 149/149 [00:17<00:00,  8.61it/s, loss=5.66]\n",
+      "valid: 100%|█████████████████████████████████████████████████████████████████████████████| 38/38 [00:01<00:00, 23.16it/s, loss=6.54]\n",
+      "epoch 4: 100%|█████████████████████████████████████████████████████████████████████████| 149/149 [00:17<00:00,  8.76it/s, loss=5.43]\n",
+      "valid: 100%|█████████████████████████████████████████████████████████████████████████████| 38/38 [00:01<00:00, 22.03it/s, loss=6.56]\n",
+      "epoch 5: 100%|█████████████████████████████████████████████████████████████████████████| 149/149 [00:17<00:00,  8.28it/s, loss=5.23]\n",
+      "valid: 100%|█████████████████████████████████████████████████████████████████████████████| 38/38 [00:01<00:00, 22.60it/s, loss=6.59]\n"
+     ]
+    }
+   ],
+   "source": [
+    "trainer.fit(\n",
+    "    X_train={\n",
+    "        \"X_wide\": X_train_wide,\n",
+    "        \"X_tab\": X_train_tab,\n",
+    "        \"X_text\": X_train_text,\n",
+    "        \"target\": y_train,\n",
+    "    },\n",
+    "    X_val={\n",
+    "        \"X_wide\": X_test_wide,\n",
+    "        \"X_tab\": X_test_tab,\n",
+    "        \"X_text\": X_test_text,\n",
+    "        \"target\": y_test,\n",
+    "    },\n",
+    "    n_epochs=5,\n",
+    "    batch_size=512,\n",
+    "    shuffle=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8f9aec7",
+   "metadata": {},
+   "source": [
+    "Now one could continue to the 'compare' metrics section of the Kaggle notebook. However, for the purposes of illustrating how one could use `pytorch-widedeep` to build recommendation algorithms we consider this notebook completed and move onto part 2 "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/notebooks/19_wide_and_deep_for_recsys_pt2.ipynb b/examples/notebooks/19_wide_and_deep_for_recsys_pt2.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..c5cfbc8d7ee1289a2e7c2b684e80477854776012
--- /dev/null
+++ b/examples/notebooks/19_wide_and_deep_for_recsys_pt2.ipynb
@@ -0,0 +1,368 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is the second of the two notebooks where we aim to illustrate how one could use this library to build recommendation algorithms using the example in this [Kaggle notebook](https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch) as guidance. In the previous notebook we used `pytorch-widedeep` to build a model that replicated almost exactly that in the notebook. In this, shorter notebook we will show how one could use the library to explore other models, following the same problem formulation, this is: given a state of a user at a certain point in time having watched a series of movies, our goal is to predict which movie the user will watch next. \n",
+    "\n",
+    "Assuming that one has read (and run) the previous notebook, the required data will be stored in a local dir called `prepared_data`, so let's read it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import pandas as pd\n",
+    "from torch import nn\n",
+    "\n",
+    "from pytorch_widedeep import Trainer\n",
+    "from pytorch_widedeep.utils import pad_sequences\n",
+    "from pytorch_widedeep.models import TabMlp, WideDeep, Transformer\n",
+    "from pytorch_widedeep.preprocessing import TabPreprocessor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "save_path = Path(\"prepared_data\")\n",
+    "\n",
+    "PAD_IDX = 0\n",
+    "\n",
+    "id_cols = [\"user_id\", \"movie_id\"]\n",
+    "\n",
+    "df_train = pd.read_pickle(save_path / \"df_train.pkl\")\n",
+    "df_valid = pd.read_pickle(save_path / \"df_valid.pkl\")\n",
+    "df_test = pd.read_pickle(save_path / \"df_test.pkl\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "...remember that in the previous notebook we explained that we are not  going to use a validation set here (in a real-world example, or simply a more realistic example, one should always use it).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_test = pd.concat([df_valid, df_test], ignore_index=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Also remember that, in the previous notebook we discussed that the `'maxlen'` and `'max_movie_index'` parameters should be computed using only the train set. In particular, to properly do the tokenization, one would have to use ONLY train tokens and add a token for new 'unknown'/'unseen' movies in the test set. This can also be done with this library or manually, so I will leave it to the reader to implement that tokenzation appraoch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "maxlen = max(\n",
+    "    df_train.prev_movies.apply(lambda x: len(x)).max(),\n",
+    "    df_test.prev_movies.apply(lambda x: len(x)).max(),\n",
+    ")\n",
+    "\n",
+    "max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From now one things are pretty simple, moreover bearing in mind that in this example we are not going to use a wide component since, in pple, one would believe that the information in that component is also 'carried' by the movie sequences (However in the previous notebook, if one performs ablation studies, these suggest that most of the prediction power comes from the linear, wide model).\n",
+    "\n",
+    "In the example here we are going to explore one (of many) possibilities. We are simply going to encode the triplet `(user, item, rating)` and use it as a `deeptabular` component and the sequences of previously watched movies as the `deeptext` component. For the `deeptext` component we are going to use a basic encoder-only transformer model.\n",
+    "\n",
+    "Let's start with the tabular data preparation\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_train_user_item = df_train[[\"user_id\", \"movie_id\", \"rating\"]]\n",
+    "train_movies_sequences = df_train.prev_movies.apply(\n",
+    "    lambda x: [int(el) for el in x]\n",
+    ").to_list()\n",
+    "y_train = df_train.target.values.astype(int)\n",
+    "\n",
+    "df_test_user_item = df_train[[\"user_id\", \"movie_id\", \"rating\"]]\n",
+    "test_movies_sequences = df_test.prev_movies.apply(\n",
+    "    lambda x: [int(el) for el in x]\n",
+    ").to_list()\n",
+    "y_test = df_test.target.values.astype(int)\n",
+    "\n",
+    "tab_preprocessor = tab_preprocessor = TabPreprocessor(\n",
+    "    cat_embed_cols=[\"user_id\", \"movie_id\", \"rating\"],\n",
+    ")\n",
+    "X_train_tab = tab_preprocessor.fit_transform(df_train_user_item)\n",
+    "X_test_tab = tab_preprocessor.transform(df_test_user_item)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And not the text component, simply padding the sequences:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_text = np.array(\n",
+    "    [\n",
+    "        pad_sequences(\n",
+    "            s,\n",
+    "            maxlen=maxlen,\n",
+    "            pad_first=False,\n",
+    "            pad_idx=PAD_IDX,\n",
+    "        )\n",
+    "        for s in train_movies_sequences\n",
+    "    ]\n",
+    ")\n",
+    "X_test_text = np.array(\n",
+    "    [\n",
+    "        pad_sequences(\n",
+    "            s,\n",
+    "            maxlen=maxlen,\n",
+    "            pad_first=False,\n",
+    "            pad_idx=0,\n",
+    "        )\n",
+    "        for s in test_movies_sequences\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now define the model components and the wide and deep model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tab_mlp = TabMlp(\n",
+    "    column_idx=tab_preprocessor.column_idx,\n",
+    "    cat_embed_input=tab_preprocessor.cat_embed_input,\n",
+    "    mlp_hidden_dims=[1024, 512, 256],\n",
+    "    mlp_activation=\"relu\",\n",
+    ")\n",
+    "\n",
+    "# plenty of options here, see the docs\n",
+    "transformer = Transformer(\n",
+    "    vocab_size=max_movie_index + 1,\n",
+    "    embed_dim=32,\n",
+    "    n_heads=2,\n",
+    "    n_blocks=2,\n",
+    "    seq_length=maxlen,\n",
+    ")\n",
+    "\n",
+    "wide_deep_model = WideDeep(\n",
+    "    deeptabular=tab_mlp, deeptext=transformer, pred_dim=max_movie_index + 1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "WideDeep(\n",
+       "  (deeptabular): Sequential(\n",
+       "    (0): TabMlp(\n",
+       "      (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(\n",
+       "        (cat_embed): DiffSizeCatEmbeddings(\n",
+       "          (embed_layers): ModuleDict(\n",
+       "            (emb_layer_user_id): Embedding(749, 65, padding_idx=0)\n",
+       "            (emb_layer_movie_id): Embedding(1612, 100, padding_idx=0)\n",
+       "            (emb_layer_rating): Embedding(6, 4, padding_idx=0)\n",
+       "          )\n",
+       "          (embedding_dropout): Dropout(p=0.1, inplace=False)\n",
+       "        )\n",
+       "      )\n",
+       "      (encoder): MLP(\n",
+       "        (mlp): Sequential(\n",
+       "          (dense_layer_0): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=169, out_features=1024, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_1): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=1024, out_features=512, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "          (dense_layer_2): Sequential(\n",
+       "            (0): Dropout(p=0.1, inplace=False)\n",
+       "            (1): Linear(in_features=512, out_features=256, bias=True)\n",
+       "            (2): ReLU(inplace=True)\n",
+       "          )\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "    (1): Linear(in_features=256, out_features=1683, bias=True)\n",
+       "  )\n",
+       "  (deeptext): Sequential(\n",
+       "    (0): Transformer(\n",
+       "      (embedding): Embedding(1683, 32)\n",
+       "      (pos_encoder): PositionalEncoding(\n",
+       "        (dropout): Dropout(p=0.1, inplace=False)\n",
+       "      )\n",
+       "      (encoder): Sequential(\n",
+       "        (transformer_block0): TransformerEncoder(\n",
+       "          (attn): MultiHeadedAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "            (kv_proj): Linear(in_features=32, out_features=64, bias=False)\n",
+       "            (out_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "          )\n",
+       "          (ff): FeedForward(\n",
+       "            (w_1): Linear(in_features=32, out_features=128, bias=True)\n",
+       "            (w_2): Linear(in_features=128, out_features=32, bias=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (activation): GELU(approximate='none')\n",
+       "          )\n",
+       "          (attn_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "          (ff_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "        )\n",
+       "        (transformer_block1): TransformerEncoder(\n",
+       "          (attn): MultiHeadedAttention(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (q_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "            (kv_proj): Linear(in_features=32, out_features=64, bias=False)\n",
+       "            (out_proj): Linear(in_features=32, out_features=32, bias=False)\n",
+       "          )\n",
+       "          (ff): FeedForward(\n",
+       "            (w_1): Linear(in_features=32, out_features=128, bias=True)\n",
+       "            (w_2): Linear(in_features=128, out_features=32, bias=True)\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (activation): GELU(approximate='none')\n",
+       "          )\n",
+       "          (attn_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "          (ff_addnorm): AddNorm(\n",
+       "            (dropout): Dropout(p=0.1, inplace=False)\n",
+       "            (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)\n",
+       "          )\n",
+       "        )\n",
+       "      )\n",
+       "    )\n",
+       "    (1): Linear(in_features=23552, out_features=1683, bias=True)\n",
+       "  )\n",
+       ")"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "wide_deep_model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And as in the previous notebook, let's train (you will need a GPU for this)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=wide_deep_model,\n",
+    "    objective=\"multiclass\",\n",
+    "    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),\n",
+    "    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),\n",
+    ")\n",
+    "\n",
+    "trainer.fit(\n",
+    "    X_train={\n",
+    "        \"X_tab\": X_train_tab,\n",
+    "        \"X_text\": X_train_text,\n",
+    "        \"target\": y_train,\n",
+    "    },\n",
+    "    X_val={\n",
+    "        \"X_tab\": X_test_tab,\n",
+    "        \"X_text\": X_test_text,\n",
+    "        \"target\": y_test,\n",
+    "    },\n",
+    "    n_epochs=10,\n",
+    "    batch_size=521,\n",
+    "    shuffle=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/scripts/wide_deep_for_recsys/kaggle_wide_deep_model.py b/examples/scripts/wide_deep_for_recsys/kaggle_wide_deep_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..4036f00988def0db98e7f3b7f7c978b6c65e9e45
--- /dev/null
+++ b/examples/scripts/wide_deep_for_recsys/kaggle_wide_deep_model.py
@@ -0,0 +1,203 @@
+# This script is mostly a copy/paste from the Kaggle notebook
+# https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch.
+# Is a response to the issue:
+# https://github.com/jrzaurin/pytorch-widedeep/issues/133.
+# In this script we run the exact same model used in that Kaggle notebook
+
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn, cat, mean
+from scipy.sparse import coo_matrix
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+save_path = Path("prepared_data")
+
+
+def get_coo_indexes(lil):
+    rows = []
+    cols = []
+    for i, el in enumerate(lil):
+        if type(el) != list:
+            el = [el]
+        for j in el:
+            rows.append(i)
+            cols.append(j)
+    return rows, cols
+
+
+def get_sparse_features(series, shape):
+    coo_indexes = get_coo_indexes(series.tolist())
+    sparse_df = coo_matrix(
+        (np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape
+    )
+    return sparse_df
+
+
+def sparse_to_idx(data, pad_idx=-1):
+    indexes = data.nonzero()
+    indexes_df = pd.DataFrame()
+    indexes_df["rows"] = indexes[0]
+    indexes_df["cols"] = indexes[1]
+    mdf = indexes_df.groupby("rows").apply(lambda x: x["cols"].tolist())
+    max_len = mdf.apply(lambda x: len(x)).max()
+    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values
+
+
+def idx_to_sparse(idx, sparse_dim):
+    sparse = np.zeros(sparse_dim)
+    sparse[int(idx)] = 1
+    return pd.Series(sparse, dtype=int)
+
+
+def process_cats_as_kaggle_notebook(df):
+    df["gender"] = (df["gender"] == "M").astype(int)
+    df = pd.concat(
+        [
+            df.drop("occupation", axis=1),
+            pd.get_dummies(df["occupation"]).astype(int),
+        ],
+        axis=1,
+    )
+    df.drop("other", axis=1, inplace=True)
+    df.drop("zip_code", axis=1, inplace=True)
+
+    return df
+
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+df_train = process_cats_as_kaggle_notebook(df_train)
+df_test = process_cats_as_kaggle_notebook(df_test)
+
+# here is another caveat, using all dataset to build 'train_movies_watched'
+# when in reality one should use only the training
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+X_train = df_train.drop(id_cols + ["prev_movies", "target"], axis=1)
+y_train = df_train.target.values
+train_movies_watched = get_sparse_features(
+    df_train["prev_movies"], (len(df_train), max_movie_index + 1)
+)
+
+X_test = df_test.drop(id_cols + ["prev_movies", "target"], axis=1)
+y_test = df_test.target.values
+test_movies_watched = get_sparse_features(
+    df_test["prev_movies"], (len(df_test), max_movie_index + 1)
+)
+
+PAD_IDX = 0
+
+X_train_tensor = torch.Tensor(X_train.fillna(0).values).to(device)
+train_movies_watched_tensor = (
+    torch.sparse_coo_tensor(
+        indices=train_movies_watched.nonzero(),
+        values=[1] * len(train_movies_watched.nonzero()[0]),
+        size=train_movies_watched.shape,
+    )
+    .to_dense()
+    .to(device)
+)
+movies_train_sequences = (
+    torch.Tensor(
+        sparse_to_idx(train_movies_watched, pad_idx=PAD_IDX),
+    )
+    .long()
+    .to(device)
+)
+target_train = torch.Tensor(y_train).long().to(device)
+
+
+X_test_tensor = torch.Tensor(X_test.fillna(0).values).to(device)
+test_movies_watched_tensor = (
+    torch.sparse_coo_tensor(
+        indices=test_movies_watched.nonzero(),
+        values=[1] * len(test_movies_watched.nonzero()[0]),
+        size=test_movies_watched.shape,
+    )
+    .to_dense()
+    .to(device)
+)
+movies_test_sequences = (
+    torch.Tensor(
+        sparse_to_idx(test_movies_watched, pad_idx=PAD_IDX),
+    )
+    .long()
+    .to(device)
+)
+target_test = torch.Tensor(y_test).long().to(device)
+
+
+class WideAndDeep(nn.Module):
+    def __init__(
+        self,
+        continious_feature_shape,  # number of continious features
+        embed_size,  # size of embedding for binary features
+        embed_dict_len,  # number of unique binary features
+        pad_idx,  # padding index
+    ):
+        super(WideAndDeep, self).__init__()
+        self.embed = nn.Embedding(embed_dict_len, embed_size, padding_idx=pad_idx)
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(embed_size + continious_feature_shape, 1024),
+            nn.ReLU(),
+            nn.Linear(1024, 512),
+            nn.ReLU(),
+            nn.Linear(512, 256),
+            nn.ReLU(),
+        )
+        self.head = nn.Sequential(
+            nn.Linear(embed_dict_len + 256, embed_dict_len),
+        )
+
+    def forward(self, continious, binary, binary_idx):
+        # get embeddings for sequence of indexes
+        binary_embed = self.embed(binary_idx)
+        binary_embed_mean = mean(binary_embed, dim=1)
+        # get logits for "deep" part: continious features + binary embeddings
+        deep_logits = self.linear_relu_stack(
+            cat((continious, binary_embed_mean), dim=1)
+        )
+        # get final softmax logits for "deep" part and raw binary features
+        total_logits = self.head(cat((deep_logits, binary), dim=1))
+        return total_logits
+
+
+model = WideAndDeep(X_train.shape[1], 16, max_movie_index + 1, PAD_IDX).to(device)
+print(model)
+
+
+EPOCHS = 10
+loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+
+for t in range(EPOCHS):
+    model.train()
+    pred_train = model(
+        X_train_tensor, train_movies_watched_tensor, movies_train_sequences
+    )
+    loss_train = loss_fn(pred_train, target_train)
+
+    # Backpropagation
+    optimizer.zero_grad()
+    loss_train.backward()
+    optimizer.step()
+
+    model.eval()
+    with torch.no_grad():
+        pred_test = model(
+            X_test_tensor, test_movies_watched_tensor, movies_test_sequences
+        )
+        loss_test = loss_fn(pred_test, target_test)
+
+    print(f"Epoch {t}")
+    print(f"Train loss: {loss_train:>7f}")
+    print(f"Test loss: {loss_test:>7f}")
diff --git a/examples/scripts/wide_deep_for_recsys/ml100k_data_preparation.py b/examples/scripts/wide_deep_for_recsys/ml100k_data_preparation.py
new file mode 100644
index 0000000000000000000000000000000000000000..f701ce193b1bb099ea9619bb71e192da61b927ca
--- /dev/null
+++ b/examples/scripts/wide_deep_for_recsys/ml100k_data_preparation.py
@@ -0,0 +1,103 @@
+# This script is mostly a copy/paste from the Kaggle notebook
+# https://www.kaggle.com/code/matanivanov/wide-deep-learning-for-recsys-with-pytorch.
+# Is a response to the issue:
+# https://github.com/jrzaurin/pytorch-widedeep/issues/133 In this script we
+# simply prepare the data that will later be used for a custom Wide and Deep
+# model and for Wide and Deep models created using this library
+from pathlib import Path
+
+from sklearn.model_selection import train_test_split
+
+from pytorch_widedeep.datasets import load_movielens100k
+
+data, user, items = load_movielens100k(as_frame=True)
+
+# Alternatively, as specified in the docs: 'The last 19 fields are the genres' so:
+# list_of_genres = items.columns.tolist()[-19:]
+list_of_genres = [
+    "unknown",
+    "Action",
+    "Adventure",
+    "Animation",
+    "Children's",
+    "Comedy",
+    "Crime",
+    "Documentary",
+    "Drama",
+    "Fantasy",
+    "Film-Noir",
+    "Horror",
+    "Musical",
+    "Mystery",
+    "Romance",
+    "Sci-Fi",
+    "Thriller",
+    "War",
+    "Western",
+]
+
+
+# adding a column with the number of movies watched per user
+dataset = data.sort_values(["user_id", "timestamp"]).reset_index(drop=True)
+dataset["one"] = 1
+dataset["num_watched"] = dataset.groupby("user_id")["one"].cumsum()
+dataset.drop("one", axis=1, inplace=True)
+
+# adding a column with the mean rating at a point in time per user
+dataset["mean_rate"] = (
+    dataset.groupby("user_id")["rating"].cumsum() / dataset["num_watched"]
+)
+
+# In this particular exercise the problem is formulating as predicting the
+# next movie that will be watched (in consequence the last interactions will be discarded)
+dataset["target"] = dataset.groupby("user_id")["movie_id"].shift(-1)
+
+# Here the author builds the sequences
+dataset["prev_movies"] = dataset["movie_id"].apply(lambda x: str(x))
+dataset["prev_movies"] = (
+    dataset.groupby("user_id")["prev_movies"]
+    .apply(lambda x: (x + " ").cumsum().str.strip())
+    .reset_index(drop=True)
+)
+dataset["prev_movies"] = dataset["prev_movies"].apply(lambda x: x.split())
+
+# Adding a genre_rate as the mean of all movies rated for a given genre per
+# user
+dataset = dataset.merge(items[["movie_id"] + list_of_genres], on="movie_id", how="left")
+for genre in list_of_genres:
+    dataset[f"{genre}_rate"] = dataset[genre] * dataset["rating"]
+    dataset[genre] = dataset.groupby("user_id")[genre].cumsum()
+    dataset[f"{genre}_rate"] = (
+        dataset.groupby("user_id")[f"{genre}_rate"].cumsum() / dataset[genre]
+    )
+dataset[list_of_genres] = dataset[list_of_genres].apply(
+    lambda x: x / dataset["num_watched"]
+)
+
+# Again, we use the same settings as those in the Kaggle notebook,
+# but 'COLD_START_TRESH' is pretty aggressive
+COLD_START_TRESH = 5
+
+filtred_data = dataset[
+    (dataset["num_watched"] >= COLD_START_TRESH) & ~(dataset["target"].isna())
+].sort_values("timestamp")
+train_data, _test_data = train_test_split(filtred_data, test_size=0.2, shuffle=False)
+valid_data, test_data = train_test_split(_test_data, test_size=0.5, shuffle=False)
+
+cols_to_drop = [
+    # "rating",
+    "timestamp",
+    "num_watched",
+]
+
+df_train = train_data.drop(cols_to_drop, axis=1)
+df_valid = valid_data.drop(cols_to_drop, axis=1)
+df_test = test_data.drop(cols_to_drop, axis=1)
+
+save_path = Path("prepared_data")
+if not save_path.exists():
+    save_path.mkdir(parents=True, exist_ok=True)
+
+df_train.to_pickle(save_path / "df_train.pkl")
+df_valid.to_pickle(save_path / "df_valid.pkl")
+df_test.to_pickle(save_path / "df_test.pkl")
diff --git a/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt1.py b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt1.py
new file mode 100644
index 0000000000000000000000000000000000000000..4258d9a3c1f625bbeb2c080dfe4259671d7b62ac
--- /dev/null
+++ b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt1.py
@@ -0,0 +1,216 @@
+# In this script I illustrate how one coould use our library to reproduce
+# almost exactly the same model used in the Kaggle Notebook
+
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn
+from scipy.sparse import coo_matrix
+
+from pytorch_widedeep import Trainer
+from pytorch_widedeep.models import TabMlp, BasicRNN, WideDeep
+from pytorch_widedeep.preprocessing import TabPreprocessor
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+save_path = Path("prepared_data")
+
+PAD_IDX = 0
+
+
+def get_coo_indexes(lil):
+    rows = []
+    cols = []
+    for i, el in enumerate(lil):
+        if type(el) != list:
+            el = [el]
+        for j in el:
+            rows.append(i)
+            cols.append(j)
+    return rows, cols
+
+
+def get_sparse_features(series, shape):
+    coo_indexes = get_coo_indexes(series.tolist())
+    sparse_df = coo_matrix(
+        (np.ones(len(coo_indexes[0])), (coo_indexes[0], coo_indexes[1])), shape=shape
+    )
+    return sparse_df
+
+
+def sparse_to_idx(data, pad_idx=-1):
+    indexes = data.nonzero()
+    indexes_df = pd.DataFrame()
+    indexes_df["rows"] = indexes[0]
+    indexes_df["cols"] = indexes[1]
+    mdf = indexes_df.groupby("rows").apply(lambda x: x["cols"].tolist())
+    max_len = mdf.apply(lambda x: len(x)).max()
+    return mdf.apply(lambda x: pd.Series(x + [pad_idx] * (max_len - len(x)))).values
+
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+# here is another caveat, using all dataset to build 'train_movies_watched'
+# when in reality one should use only the training
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+X_train = df_train.drop(id_cols + ["rating", "prev_movies", "target"], axis=1)
+y_train = np.array(df_train.target.values, dtype="int64")
+train_movies_watched = get_sparse_features(
+    df_train["prev_movies"], (len(df_train), max_movie_index + 1)
+)
+
+X_test = df_test.drop(id_cols + ["rating", "prev_movies", "target"], axis=1)
+y_test = np.array(df_test.target.values, dtype="int64")
+test_movies_watched = get_sparse_features(
+    df_test["prev_movies"], (len(df_test), max_movie_index + 1)
+)
+
+cat_cols = ["gender", "occupation", "zip_code"]
+cont_cols = [c for c in X_train if c not in cat_cols]
+tab_preprocessor = TabPreprocessor(
+    cat_embed_cols=cat_cols,
+    continuous_cols=cont_cols,
+)
+
+# The sparse matrices need to be turned into dense whether at array or tensor
+# stage. This is one of the reasons why the wide component in our library is
+# implemented as Embeddings. However, our implementation is still not
+# suitable for the type of pre-processing that the author of the Kaggle
+# notebook did to come up with the what it would be the wide component
+# (a sparse martrix with 1s at those locations corresponding to the movies
+# that a user has seen at a point in time). Therefore, we will have to code a
+# Wide model (fairly simple since it is a linear layer)
+X_train_wide = np.array(train_movies_watched.todense())
+X_test_wide = np.array(test_movies_watched.todense())
+
+# Here our tabular component is a bit more elaborated than that in the
+# notebook, just a bit...
+X_train_tab = tab_preprocessor.fit_transform(X_train.fillna(0))
+X_test_tab = tab_preprocessor.transform(X_test.fillna(0))
+
+# The text component are the sequences of movies wacthed. There is an element
+# of information redundancy here in my opinion. This is because the wide and
+# text components have implicitely the same information, but in different
+# form. Anyway, we want to reproduce the Kaggle notebook as close as
+# possible.
+X_train_text = sparse_to_idx(train_movies_watched, pad_idx=PAD_IDX)
+X_test_text = sparse_to_idx(test_movies_watched, pad_idx=PAD_IDX)
+
+
+class Wide(nn.Module):
+    def __init__(self, input_dim: int, pred_dim: int):
+        super().__init__()
+
+        self.input_dim = input_dim
+        self.pred_dim = pred_dim
+
+        # The way I coded the library I never though that someone would ever
+        # wanted to code their own wide component. However, if you do, the
+        # wide component must have a 'wide_linear' attribute. In other words,
+        # the linear layer must be called 'wide_linear'
+        self.wide_linear = nn.Linear(input_dim, pred_dim)
+
+    def forward(self, X):
+        out = self.wide_linear(X.type(torch.float32))
+        return out
+
+
+wide = Wide(X_train_wide.shape[1], max_movie_index + 1)
+
+
+class SimpleEmbed(nn.Module):
+    def __init__(self, vocab_size: int, embed_dim: int, pad_idx: int):
+        super().__init__()
+
+        self.vocab_size = vocab_size
+        self.embed_dim = embed_dim
+        self.pad_idx = pad_idx
+
+        # The sequences of movies watched are simply embedded in the Kaggle
+        # notebook. No RNN, Transformer or any model is used
+        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
+
+    def forward(self, X):
+        embed = self.embed(X)
+        embed_mean = torch.mean(embed, dim=1)
+        return embed_mean
+
+    @property
+    def output_dim(self) -> int:
+        return self.embed_dim
+
+
+# In the notebook the author uses simply embeddings
+simple_embed = SimpleEmbed(max_movie_index + 1, 16, 0)
+# but maybe one would like to use an RNN to account for the sequence nature of
+# the problem formulation
+basic_rnn = BasicRNN(
+    vocab_size=max_movie_index + 1,
+    embed_dim=16,
+    hidden_dim=32,
+    n_layers=2,
+    rnn_type="gru",
+)
+
+tab_mlp = TabMlp(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    continuous_cols=tab_preprocessor.continuous_cols,
+    cont_norm_layer=None,
+    mlp_hidden_dims=[1024, 512, 256],
+    mlp_activation="relu",
+)
+
+# The main difference between this wide and deep model and the Wide and Deep
+# model in the Kaggle notebook is that in that notebook, the author
+# concatenates the embedings and the tabular features(which he refers
+# as 'continuous'), then passes this concatenation through a stack of
+# linear + Relu layers. Then concatenates this output with the binary
+# features and connects this concatenation with the final linear layer. Our
+# implementation follows the notation of the original paper and instead of
+# concatenating the tabular, text and wide components, we first compute their
+# output, and then add it (see here: https://arxiv.org/pdf/1606.07792.pdf,
+# their Eq 3). Note that this is effectively the same with the caveat that
+# while in one case we initialise a big weight matrix at once, in our
+# implementation we initialise different matrices for different components.
+# Anyway, let's give it a go.
+wide_deep_model = WideDeep(
+    wide=wide, deeptabular=tab_mlp, deeptext=simple_embed, pred_dim=max_movie_index + 1
+)
+# # To use an RNN, simply
+# wide_deep_model = WideDeep(
+#     wide=wide, deeptabular=tab_mlp, deeptext=basic_rnn, pred_dim=max_movie_index + 1
+# )
+
+trainer = Trainer(
+    model=wide_deep_model,
+    objective="multiclass",
+    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),
+    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),
+)
+
+trainer.fit(
+    X_train={
+        "X_wide": X_train_wide,
+        "X_tab": X_train_tab,
+        "X_text": X_train_text,
+        "target": y_train,
+    },
+    X_val={
+        "X_wide": X_test_wide,
+        "X_tab": X_test_tab,
+        "X_text": X_test_text,
+        "target": y_test,
+    },
+    n_epochs=10,
+    batch_size=512,
+    shuffle=False,
+)
diff --git a/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt2.py b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt2.py
new file mode 100644
index 0000000000000000000000000000000000000000..053a7f0b050114cc9e91051a6ebb254d5ff4810a
--- /dev/null
+++ b/examples/scripts/wide_deep_for_recsys/pytorch_wide_deep_pt2.py
@@ -0,0 +1,130 @@
+from pathlib import Path
+
+import numpy as np
+import torch
+import pandas as pd
+from torch import nn
+
+from pytorch_widedeep import Trainer
+from pytorch_widedeep.utils import pad_sequences
+from pytorch_widedeep.models import TabMlp, WideDeep, Transformer
+from pytorch_widedeep.preprocessing import TabPreprocessor
+
+save_path = Path("prepared_data")
+
+PAD_IDX = 0
+
+id_cols = ["user_id", "movie_id"]
+
+df_train = pd.read_pickle(save_path / "df_train.pkl")
+df_valid = pd.read_pickle(save_path / "df_valid.pkl")
+df_test = pd.read_pickle(save_path / "df_test.pkl")
+df_test = pd.concat([df_valid, df_test], ignore_index=True)
+
+# sequence length. Shorter sequences will be padded to this length. This is
+# identical to the Kaggle's implementation
+maxlen = max(
+    df_train.prev_movies.apply(lambda x: len(x)).max(),
+    df_test.prev_movies.apply(lambda x: len(x)).max(),
+)
+
+# Here there is a caveat. In pple, we are using (as in the Kaggle notebook)
+# all indexes to compute the number of tokens in the dataset. To do this
+# properly, one would have to use ONLY train tokens and add a token for new
+# unknown/unseen movies in the test set. This can also be done with this
+# library and manually, so I will leave it to the reader to implement that
+# tokenzation appraoch
+max_movie_index = max(df_train.movie_id.max(), df_test.movie_id.max())
+
+# From now one things are pretty simple, moreover bearing in mind that in this
+# example we are not going to use a wide component since, in pple, I believe
+# the information in that component is also 'carried' by the movie sequences
+# (also in previous scripts one can see that most prediction power comes from
+# the linear, wide model)
+df_train_user_item = df_train[["user_id", "movie_id", "rating"]]
+train_movies_sequences = df_train.prev_movies.apply(
+    lambda x: [int(el) for el in x]
+).to_list()
+y_train = df_train.target.values.astype(int)
+
+df_test_user_item = df_train[["user_id", "movie_id", "rating"]]
+test_movies_sequences = df_test.prev_movies.apply(
+    lambda x: [int(el) for el in x]
+).to_list()
+y_test = df_test.target.values.astype(int)
+
+# As a tabular component we are going to encode simply the triplets
+# (user, items, rating)
+tab_preprocessor = tab_preprocessor = TabPreprocessor(
+    cat_embed_cols=["user_id", "movie_id", "rating"],
+)
+X_train_tab = tab_preprocessor.fit_transform(df_train_user_item)
+X_test_tab = tab_preprocessor.transform(df_test_user_item)
+
+# And here we pad the sequences and define a transformer model for the text
+# component that is, in this case, the sequences of movies watched
+X_train_text = np.array(
+    [
+        pad_sequences(
+            s,
+            maxlen=maxlen,
+            pad_first=False,
+            pad_idx=PAD_IDX,
+        )
+        for s in train_movies_sequences
+    ]
+)
+X_test_text = np.array(
+    [
+        pad_sequences(
+            s,
+            maxlen=maxlen,
+            pad_first=False,
+            pad_idx=0,
+        )
+        for s in test_movies_sequences
+    ]
+)
+
+tab_mlp = TabMlp(
+    column_idx=tab_preprocessor.column_idx,
+    cat_embed_input=tab_preprocessor.cat_embed_input,
+    mlp_hidden_dims=[1024, 512, 256],
+    mlp_activation="relu",
+)
+
+# plenty of options here, see the docs
+transformer = Transformer(
+    vocab_size=max_movie_index + 1,
+    embed_dim=16,
+    n_heads=2,
+    n_blocks=2,
+    seq_length=maxlen,
+)
+
+wide_deep_model = WideDeep(
+    deeptabular=tab_mlp, deeptext=transformer, pred_dim=max_movie_index + 1
+)
+
+trainer = Trainer(
+    model=wide_deep_model,
+    objective="multiclass",
+    custom_loss_function=nn.CrossEntropyLoss(ignore_index=PAD_IDX),
+    optimizers=torch.optim.Adam(wide_deep_model.parameters(), lr=1e-3),
+)
+
+trainer.fit(
+    X_train={
+        "X_tab": X_train_tab,
+        "X_text": X_train_text,
+        "target": y_train,
+    },
+    X_val={
+        "X_tab": X_test_tab,
+        "X_text": X_test_text,
+        "target": y_test,
+    },
+    n_epochs=10,
+    batch_size=521,
+    shuffle=False,
+)
diff --git a/pytorch_widedeep/datasets/__init__.py b/pytorch_widedeep/datasets/__init__.py
index 9792d454b280897f7a86319c17bf2c816de09e7b..4c9b901616bc410b32fe7beffa3b0633b4d71f05 100644
--- a/pytorch_widedeep/datasets/__init__.py
+++ b/pytorch_widedeep/datasets/__init__.py
@@ -4,6 +4,7 @@ from ._base import (
     load_birds,
     load_ecoli,
     load_bio_kdd04,
+    load_movielens100k,
     load_womens_ecommerce,
     load_california_housing,
 )
@@ -16,4 +17,5 @@ __all__ = [
     "load_birds",
     "load_rf1",
     "load_womens_ecommerce",
+    "load_movielens100k",
 ]
diff --git a/pytorch_widedeep/datasets/_base.py b/pytorch_widedeep/datasets/_base.py
index 34e18cd17547e01ee1fa029072d0af26eb945459..bb4d410b231b781a812b38ea34f28dcb68b7abe3 100644
--- a/pytorch_widedeep/datasets/_base.py
+++ b/pytorch_widedeep/datasets/_base.py
@@ -1,12 +1,14 @@
 # dataframes are saved as parquet, pyarrow, brotli
 # pd.to_parquet(path=None, engine="auto", compression="brotli", index=False)
 # see related post: https://python.plainenglish.io/storing-pandas-98-faster-disk-reads-and-72-less-space-208e2e2be8bb
+from typing import Tuple, Union
 from importlib import resources
 
+import numpy as np
 import pandas as pd
 
 
-def load_bio_kdd04(as_frame: bool = False):
+def load_bio_kdd04(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the higly imbalanced binary classification Protein Homology
     Dataset from [KDD cup 2004](https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data).
     This datasets include only bio_train.dat part of the dataset
@@ -39,7 +41,7 @@ def load_bio_kdd04(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_adult(as_frame: bool = False):
+def load_adult(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the higly imbalanced binary classification [adult income datatest](http://www.cs.toronto.edu/~delve/data/adult/desc.html).
     you may find detailed description [here](http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html)
     """
@@ -55,7 +57,7 @@ def load_adult(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_ecoli(as_frame: bool = False):
+def load_ecoli(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the higly imbalanced multiclass classification e.coli dataset
     Dataset from [UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/ecoli).
 
@@ -142,7 +144,7 @@ def load_ecoli(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_california_housing(as_frame: bool = False):
+def load_california_housing(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the higly imbalanced regression California housing dataset.
 
     Characteristics:
@@ -190,7 +192,7 @@ def load_california_housing(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_birds(as_frame: bool = False):
+def load_birds(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the multi-label classification bird dataset.
 
     References
@@ -216,7 +218,7 @@ def load_birds(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_rf1(as_frame: bool = False):
+def load_rf1(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """Load and return the multi-target regression River Flow(RF1) dataset.
 
         Characterisctics:
@@ -243,7 +245,7 @@ def load_rf1(as_frame: bool = False):
         return df.to_numpy()
 
 
-def load_womens_ecommerce(as_frame: bool = False):
+def load_womens_ecommerce(as_frame: bool = False) -> Union[np.ndarray, pd.DataFrame]:
     """
     Context
     This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers.
@@ -279,3 +281,103 @@ def load_womens_ecommerce(as_frame: bool = False):
         return df
     else:
         return df.to_numpy()
+
+
+def load_movielens100k(
+    as_frame: bool = False,
+) -> Union[
+    Tuple[np.ndarray, np.ndarray, np.ndarray],
+    Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame],
+]:
+    """Load and return the MovieLens 100k dataset in 3 separate files.
+
+    SUMMARY & USAGE LICENSE:
+    =============================================
+    MovieLens data sets were collected by the GroupLens Research Project
+    at the University of Minnesota.
+
+    This data set consists of:
+        * 100,000 ratings (1-5) from 943 users on 1682 movies.
+        * Each user has rated at least 20 movies.
+            * Simple demographic info for the users (age, gender, occupation, zip)
+
+    The data was collected through the MovieLens web site
+    (movielens.umn.edu) during the seven-month period from September 19th,
+    1997 through April 22nd, 1998. This data has been cleaned up - users
+    who had less than 20 ratings or did not have complete demographic
+    information were removed from this data set. Detailed descriptions of
+    the data file can be found at the end of this file.
+
+    Neither the University of Minnesota nor any of the researchers
+    involved can guarantee the correctness of the data, its suitability
+    for any particular purpose, or the validity of results based on the
+    use of the data set.  The data set may be used for any research
+    purposes under the following conditions:
+
+        * The user may not state or imply any endorsement from the
+        University of Minnesota or the GroupLens Research Group.
+
+        * The user must acknowledge the use of the data set in
+        publications resulting from the use of the data set
+        (see below for citation information).
+
+        * The user may not redistribute the data without separate
+        permission.
+
+        * The user may not use this information for any commercial or
+        revenue-bearing purposes without first obtaining permission
+        from a faculty member of the GroupLens Research Project at the
+        University of Minnesota.
+
+    If you have any further questions or comments, please contact GroupLens
+    <grouplens-info@cs.umn.edu>.
+
+    CITATION:
+    =============================================
+    To acknowledge use of the dataset in publications, please cite the
+    following paper:
+
+    F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
+    History and Context. ACM Transactions on Interactive Intelligent
+    Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
+    DOI=http://dx.doi.org/10.1145/2827872
+
+    Returns
+    -------
+    df_data: Union[np.ndarray, pd.DataFrame]
+        The full u data set, 100000 ratings by 943 users on 1682 items.
+        Each user has rated at least 20 movies. Users and items are
+        numbered consecutively from 1. The data is randomly
+        ordered. The time stamps are unix seconds since 1/1/1970 UTC
+    df_items: Union[np.ndarray, pd.DataFrame]
+        Information about the items (movies).
+        The last 19 fields are the genres, a 1 indicates the movie
+        is of that genre, a 0 indicates it is not; movies can be in
+        several genres at once.
+        The movie ids are the ones used in the df_data data set.
+    df_users: Union[np.ndarray, pd.DataFrame]
+        Demographic information about the users.
+        The user ids are the ones used in the df_data data set.
+    """
+    with resources.path(
+        "pytorch_widedeep.datasets.data",
+        "MovieLens100k_data.parquet.brotli",
+    ) as fpath:
+        df_data = pd.read_parquet(fpath)
+
+    with resources.path(
+        "pytorch_widedeep.datasets.data",
+        "MovieLens100k_items.parquet.brotli",
+    ) as fpath:
+        df_items = pd.read_parquet(fpath)
+
+    with resources.path(
+        "pytorch_widedeep.datasets.data",
+        "MovieLens100k_users.parquet.brotli",
+    ) as fpath:
+        df_users = pd.read_parquet(fpath)
+
+    if as_frame:
+        return df_data, df_users, df_items
+    else:
+        return df_data.to_numpy(), df_users.to_numpy(), df_items.to_numpy()
diff --git a/pytorch_widedeep/datasets/data/MovieLens100k_data.parquet.brotli b/pytorch_widedeep/datasets/data/MovieLens100k_data.parquet.brotli
new file mode 100644
index 0000000000000000000000000000000000000000..547834647a6e25b61bd511c55687c77599848746
Binary files /dev/null and b/pytorch_widedeep/datasets/data/MovieLens100k_data.parquet.brotli differ
diff --git a/pytorch_widedeep/datasets/data/MovieLens100k_items.parquet.brotli b/pytorch_widedeep/datasets/data/MovieLens100k_items.parquet.brotli
new file mode 100644
index 0000000000000000000000000000000000000000..5331eb5aa1c750f2eb8d83103d94d5b53626b369
Binary files /dev/null and b/pytorch_widedeep/datasets/data/MovieLens100k_items.parquet.brotli differ
diff --git a/pytorch_widedeep/datasets/data/MovieLens100k_users.parquet.brotli b/pytorch_widedeep/datasets/data/MovieLens100k_users.parquet.brotli
new file mode 100644
index 0000000000000000000000000000000000000000..c2d83d6c2b6c3697a5e02d369920a89fd9a85163
Binary files /dev/null and b/pytorch_widedeep/datasets/data/MovieLens100k_users.parquet.brotli differ
diff --git a/pytorch_widedeep/models/__init__.py b/pytorch_widedeep/models/__init__.py
index 9d989ac598b0f4fbe33ff70c6f78dd93c1cfedd7..b5d272d105f393b4285389d1ed4d906e950e6d29 100644
--- a/pytorch_widedeep/models/__init__.py
+++ b/pytorch_widedeep/models/__init__.py
@@ -1,5 +1,6 @@
 from pytorch_widedeep.models.text import (
     BasicRNN,
+    Transformer,
     AttentiveRNN,
     StackedAttentiveRNN,
 )
diff --git a/pytorch_widedeep/models/tabular/transformers/_attention_layers.py b/pytorch_widedeep/models/tabular/transformers/_attention_layers.py
index 3ca1452c2c1a653acef25f6118677c79d1a2c6ee..65004bcd0a4d7005ff3c6324640d83c4884eb550 100644
--- a/pytorch_widedeep/models/tabular/transformers/_attention_layers.py
+++ b/pytorch_widedeep/models/tabular/transformers/_attention_layers.py
@@ -22,16 +22,20 @@ class FeedForward(nn.Module):
         self,
         input_dim: int,
         dropout: float,
+        mult: float,
         activation: str,
-        mult: float = 4.0,
+        *,
+        ff_hidden_dim: Optional[int] = None,
     ):
         super(FeedForward, self).__init__()
-        ff_hidden_dim = int(input_dim * mult)
+        ff_hid_dim = (
+            ff_hidden_dim if ff_hidden_dim is not None else int(input_dim * mult)
+        )
         self.w_1 = nn.Linear(
             input_dim,
-            ff_hidden_dim * 2 if activation.endswith("glu") else ff_hidden_dim,
+            ff_hid_dim * 2 if activation.endswith("glu") else ff_hid_dim,
         )
-        self.w_2 = nn.Linear(ff_hidden_dim, input_dim)
+        self.w_2 = nn.Linear(ff_hid_dim, input_dim)
         self.dropout = nn.Dropout(dropout)
         self.activation = get_activation_fn(activation)
 
diff --git a/pytorch_widedeep/models/tabular/transformers/_encoders.py b/pytorch_widedeep/models/tabular/transformers/_encoders.py
index f41c793e3ac38f60373d5084935b637fc805a152..64e5a94125a4dc24e7c5e241589d405e6518f9d2 100644
--- a/pytorch_widedeep/models/tabular/transformers/_encoders.py
+++ b/pytorch_widedeep/models/tabular/transformers/_encoders.py
@@ -20,6 +20,7 @@ class TransformerEncoder(nn.Module):
         use_bias: bool,
         attn_dropout: float,
         ff_dropout: float,
+        ff_factor: int,
         activation: str,
     ):
         super(TransformerEncoder, self).__init__()
@@ -30,7 +31,7 @@ class TransformerEncoder(nn.Module):
             use_bias,
             attn_dropout,
         )
-        self.ff = FeedForward(input_dim, ff_dropout, activation)
+        self.ff = FeedForward(input_dim, ff_dropout, ff_factor, activation)
 
         self.attn_addnorm = AddNorm(input_dim, attn_dropout)
         self.ff_addnorm = AddNorm(input_dim, ff_dropout)
@@ -48,6 +49,7 @@ class SaintEncoder(nn.Module):
         use_bias: bool,
         attn_dropout: float,
         ff_dropout: float,
+        ff_factor: int,
         activation: str,
         n_feat: int,
     ):
@@ -61,7 +63,7 @@ class SaintEncoder(nn.Module):
             use_bias,
             attn_dropout,
         )
-        self.col_attn_ff = FeedForward(input_dim, ff_dropout, activation)
+        self.col_attn_ff = FeedForward(input_dim, ff_dropout, ff_factor, activation)
         self.col_attn_addnorm = AddNorm(input_dim, attn_dropout)
         self.col_attn_ff_addnorm = AddNorm(input_dim, ff_dropout)
 
@@ -71,7 +73,12 @@ class SaintEncoder(nn.Module):
             use_bias,
             attn_dropout,
         )
-        self.row_attn_ff = FeedForward(n_feat * input_dim, ff_dropout, activation)
+        self.row_attn_ff = FeedForward(
+            n_feat * input_dim,
+            ff_dropout,
+            ff_factor,
+            activation,
+        )
         self.row_attn_addnorm = AddNorm(n_feat * input_dim, attn_dropout)
         self.row_attn_ff_addnorm = AddNorm(n_feat * input_dim, ff_dropout)
 
@@ -94,10 +101,10 @@ class FTTransformerEncoder(nn.Module):
         use_bias: bool,
         attn_dropout: float,
         ff_dropout: float,
+        ff_factor: float,
         kv_compression_factor: float,
         kv_sharing: bool,
         activation: str,
-        ff_factor: float,
         first_block: bool,
     ):
         super(FTTransformerEncoder, self).__init__()
@@ -113,7 +120,7 @@ class FTTransformerEncoder(nn.Module):
             kv_compression_factor,
             kv_sharing,
         )
-        self.ff = FeedForward(input_dim, ff_dropout, activation, ff_factor)
+        self.ff = FeedForward(input_dim, ff_dropout, ff_factor, activation)
 
         self.attn_normadd = NormAdd(input_dim, attn_dropout)
         self.ff_normadd = NormAdd(input_dim, ff_dropout)
@@ -134,6 +141,7 @@ class PerceiverEncoder(nn.Module):
         use_bias: bool,
         attn_dropout: float,
         ff_dropout: float,
+        ff_factor: int,
         activation: str,
         query_dim: Optional[int] = None,
     ):
@@ -147,7 +155,7 @@ class PerceiverEncoder(nn.Module):
             query_dim,
         )
         attn_dim_out = query_dim if query_dim is not None else input_dim
-        self.ff = FeedForward(attn_dim_out, ff_dropout, activation)
+        self.ff = FeedForward(attn_dim_out, ff_dropout, ff_factor, activation)
 
         self.ln_q = nn.LayerNorm(attn_dim_out)
         self.ln_kv = nn.LayerNorm(input_dim)
@@ -171,6 +179,7 @@ class FastFormerEncoder(nn.Module):
         use_bias: bool,
         attn_dropout: float,
         ff_dropout: float,
+        ff_factor: int,
         share_qv_weights: bool,
         activation: str,
     ):
@@ -184,7 +193,7 @@ class FastFormerEncoder(nn.Module):
             share_qv_weights,
         )
 
-        self.ff = FeedForward(input_dim, ff_dropout, activation)
+        self.ff = FeedForward(input_dim, ff_dropout, ff_factor, activation)
         self.attn_addnorm = AddNorm(input_dim, attn_dropout)
         self.ff_addnorm = AddNorm(input_dim, ff_dropout)
 
diff --git a/pytorch_widedeep/models/tabular/transformers/ft_transformer.py b/pytorch_widedeep/models/tabular/transformers/ft_transformer.py
index 7b1589a628fef189f0878e8f6711168fabecdfbc..50cf0c2d7be50853fbda8a4de3bbdd6ce68910a2 100644
--- a/pytorch_widedeep/models/tabular/transformers/ft_transformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/ft_transformer.py
@@ -90,13 +90,13 @@ class FTTransformer(BaseTabularModelWithAttention):
         Dropout that will be applied to the Linear-Attention layers
     ff_dropout: float, default = 0.1
         Dropout that will be applied to the FeedForward network
-    transformer_activation: str, default = "gelu"
-        Transformer Encoder activation function. _'tanh'_, _'relu'_,
-        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
     ff_factor: float, default = 4 / 3
         Multiplicative factor applied to the first layer of the FF network in
         each Transformer block, This is normally set to 4, but they use 4/3
         in the paper.
+    transformer_activation: str, default = "gelu"
+        Transformer Encoder activation function. _'tanh'_, _'relu'_,
+        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
     mlp_hidden_dims: List, Optional, default = None
         MLP hidden dimensions. If not provided no MLP on top of the final
         FTTransformer block will be used
@@ -162,8 +162,8 @@ class FTTransformer(BaseTabularModelWithAttention):
         n_blocks: int = 4,
         attn_dropout: float = 0.2,
         ff_dropout: float = 0.1,
-        transformer_activation: str = "reglu",
         ff_factor: float = 1.33,
+        transformer_activation: str = "reglu",
         mlp_hidden_dims: Optional[List[int]] = None,
         mlp_activation: str = "relu",
         mlp_dropout: float = 0.1,
@@ -197,8 +197,8 @@ class FTTransformer(BaseTabularModelWithAttention):
         self.n_blocks = n_blocks
         self.attn_dropout = attn_dropout
         self.ff_dropout = ff_dropout
-        self.transformer_activation = transformer_activation
         self.ff_factor = ff_factor
+        self.transformer_activation = transformer_activation
 
         self.mlp_hidden_dims = mlp_hidden_dims
         self.mlp_activation = mlp_activation
@@ -226,10 +226,10 @@ class FTTransformer(BaseTabularModelWithAttention):
                     use_qkv_bias,
                     attn_dropout,
                     ff_dropout,
+                    ff_factor,
                     kv_compression_factor,
                     kv_sharing,
                     transformer_activation,
-                    ff_factor,
                     is_first,
                 ),
             )
diff --git a/pytorch_widedeep/models/tabular/transformers/saint.py b/pytorch_widedeep/models/tabular/transformers/saint.py
index cfed7488348c0f5021c8834aa1a1151bc4e0fbca..eade6550198785bff3ba15aa1558ec3967cfc932 100644
--- a/pytorch_widedeep/models/tabular/transformers/saint.py
+++ b/pytorch_widedeep/models/tabular/transformers/saint.py
@@ -80,6 +80,9 @@ class SAINT(BaseTabularModelWithAttention):
         row layers
     ff_dropout: float, default = 0.1
         Dropout that will be applied to the FeedForward network
+    ff_factor: float, default = 4
+        Multiplicative factor applied to the first layer of the FF network in
+        each Transformer block, This is normally set to 4.
     transformer_activation: str, default = "gelu"
         Transformer Encoder activation function. _'tanh'_, _'relu'_,
         _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
@@ -146,6 +149,7 @@ class SAINT(BaseTabularModelWithAttention):
         n_blocks: int = 2,
         attn_dropout: float = 0.1,
         ff_dropout: float = 0.2,
+        ff_factor: int = 4,
         transformer_activation: str = "gelu",
         mlp_hidden_dims: Optional[List[int]] = None,
         mlp_activation: str = "relu",
@@ -178,6 +182,7 @@ class SAINT(BaseTabularModelWithAttention):
         self.n_blocks = n_blocks
         self.attn_dropout = attn_dropout
         self.ff_dropout = ff_dropout
+        self.ff_factor = ff_factor
         self.transformer_activation = transformer_activation
 
         self.mlp_hidden_dims = mlp_hidden_dims
@@ -204,6 +209,7 @@ class SAINT(BaseTabularModelWithAttention):
                     use_qkv_bias,
                     attn_dropout,
                     ff_dropout,
+                    ff_factor,
                     transformer_activation,
                     self.n_feats,
                 ),
diff --git a/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py b/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
index 17e9114b566c9951156703fc0e6d4ab1a6a27059..bf61d6078d6b8fc5b070860323d3c504f06501cf 100644
--- a/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_fastformer.py
@@ -84,6 +84,9 @@ class TabFastFormer(BaseTabularModelWithAttention):
         Dropout that will be applied to the Additive Attention layers
     ff_dropout: float, default = 0.1
         Dropout that will be applied to the FeedForward network
+    ff_factor: float, default = 4
+        Multiplicative factor applied to the first layer of the FF network in
+        each Transformer block, This is normally set to 4.
     share_qv_weights: bool, default = False
         Following the paper, this is a boolean indicating if the Value ($V$) and
         the Query ($Q$) transformation parameters will be shared.
@@ -159,6 +162,7 @@ class TabFastFormer(BaseTabularModelWithAttention):
         n_blocks: int = 4,
         attn_dropout: float = 0.1,
         ff_dropout: float = 0.2,
+        ff_factor: int = 4,
         share_qv_weights: bool = False,
         share_weights: bool = False,
         transformer_activation: str = "relu",
@@ -193,6 +197,7 @@ class TabFastFormer(BaseTabularModelWithAttention):
         self.n_blocks = n_blocks
         self.attn_dropout = attn_dropout
         self.ff_dropout = ff_dropout
+        self.ff_factor = ff_factor
         self.share_qv_weights = share_qv_weights
         self.share_weights = share_weights
         self.transformer_activation = transformer_activation
@@ -218,6 +223,7 @@ class TabFastFormer(BaseTabularModelWithAttention):
             use_bias,
             attn_dropout,
             ff_dropout,
+            ff_factor,
             share_qv_weights,
             transformer_activation,
         )
@@ -236,6 +242,7 @@ class TabFastFormer(BaseTabularModelWithAttention):
                         use_bias,
                         attn_dropout,
                         ff_dropout,
+                        ff_factor,
                         share_qv_weights,
                         transformer_activation,
                     ),
diff --git a/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py b/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
index 53573aa9135a1df01666b4db0eb492b16d6a79c5..6b159760534be7a4badf9e9e5f3cd3047c6b0b04 100644
--- a/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_perceiver.py
@@ -108,6 +108,9 @@ class TabPerceiver(BaseTabularModelWithAttention):
         Dropout that will be applied to the Multi-Head Attention layers
     ff_dropout: float, default = 0.1
         Dropout that will be applied to the FeedForward network
+    ff_factor: float, default = 4
+        Multiplicative factor applied to the first layer of the FF network in
+        each Transformer block, This is normally set to 4.
     transformer_activation: str, default = "gelu"
         Transformer Encoder activation function. _'tanh'_, _'relu'_,
         _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
@@ -183,6 +186,7 @@ class TabPerceiver(BaseTabularModelWithAttention):
         share_weights: bool = False,
         attn_dropout: float = 0.1,
         ff_dropout: float = 0.1,
+        ff_factor: int = 4,
         transformer_activation: str = "geglu",
         mlp_hidden_dims: Optional[List[int]] = None,
         mlp_activation: str = "relu",
@@ -220,6 +224,7 @@ class TabPerceiver(BaseTabularModelWithAttention):
         self.share_weights = share_weights
         self.attn_dropout = attn_dropout
         self.ff_dropout = ff_dropout
+        self.ff_factor = ff_factor
         self.transformer_activation = transformer_activation
 
         self.mlp_hidden_dims = mlp_hidden_dims
@@ -343,6 +348,7 @@ class TabPerceiver(BaseTabularModelWithAttention):
                     False,  # use_bias
                     self.attn_dropout,
                     self.ff_dropout,
+                    self.ff_factor,
                     self.transformer_activation,
                     self.latent_dim,  # q_dim,
                 ),
@@ -360,6 +366,7 @@ class TabPerceiver(BaseTabularModelWithAttention):
                     False,  # use_bias
                     self.attn_dropout,
                     self.ff_dropout,
+                    self.ff_factor,
                     self.transformer_activation,
                 ),
             )
diff --git a/pytorch_widedeep/models/tabular/transformers/tab_transformer.py b/pytorch_widedeep/models/tabular/transformers/tab_transformer.py
index 868e3cbf0c86f4021053a2311777a4b1fd658c83..20211ab06b0a44a1ff7f86b3705a7e8a8c8e318a 100644
--- a/pytorch_widedeep/models/tabular/transformers/tab_transformer.py
+++ b/pytorch_widedeep/models/tabular/transformers/tab_transformer.py
@@ -86,6 +86,9 @@ class TabTransformer(BaseTabularModelWithAttention):
         Dropout that will be applied to the Multi-Head Attention layers
     ff_dropout: float, default = 0.1
         Dropout that will be applied to the FeedForward network
+    ff_factor: float, default = 4
+        Multiplicative factor applied to the first layer of the FF network in
+        each Transformer block, This is normally set to 4.
     transformer_activation: str, default = "gelu"
         Transformer Encoder activation function. _'tanh'_, _'relu'_,
         _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
@@ -153,6 +156,7 @@ class TabTransformer(BaseTabularModelWithAttention):
         n_blocks: int = 4,
         attn_dropout: float = 0.2,
         ff_dropout: float = 0.1,
+        ff_factor: int = 4,
         transformer_activation: str = "gelu",
         mlp_hidden_dims: Optional[List[int]] = None,
         mlp_activation: str = "relu",
@@ -186,6 +190,7 @@ class TabTransformer(BaseTabularModelWithAttention):
         self.attn_dropout = attn_dropout
         self.ff_dropout = ff_dropout
         self.transformer_activation = transformer_activation
+        self.ff_factor = ff_factor
 
         self.mlp_hidden_dims = mlp_hidden_dims
         self.mlp_activation = mlp_activation
@@ -215,6 +220,7 @@ class TabTransformer(BaseTabularModelWithAttention):
                     use_qkv_bias,
                     attn_dropout,
                     ff_dropout,
+                    ff_factor,
                     transformer_activation,
                 ),
             )
diff --git a/pytorch_widedeep/models/text/__init__.py b/pytorch_widedeep/models/text/__init__.py
index 4dec7578dd8fff208ff450032f45b1fc8744e0f4..4a9afc0660021b2e4e3d119c2150b1914f72b485 100644
--- a/pytorch_widedeep/models/text/__init__.py
+++ b/pytorch_widedeep/models/text/__init__.py
@@ -1,5 +1,6 @@
 from pytorch_widedeep.models.text.basic_rnn import BasicRNN
 from pytorch_widedeep.models.text.attentive_rnn import AttentiveRNN
+from pytorch_widedeep.models.text.basic_transformer import Transformer
 from pytorch_widedeep.models.text.stacked_attentive_rnn import (
     StackedAttentiveRNN,
 )
diff --git a/pytorch_widedeep/models/text/basic_transformer.py b/pytorch_widedeep/models/text/basic_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7591ac65efc99c323385bae74ce82d39a9883e8c
--- /dev/null
+++ b/pytorch_widedeep/models/text/basic_transformer.py
@@ -0,0 +1,191 @@
+import math
+
+import torch
+from torch import nn
+
+from pytorch_widedeep.wdtypes import Union, Tensor, Optional
+from pytorch_widedeep.utils.general_utils import Alias
+from pytorch_widedeep.models.tabular.transformers._encoders import (
+    TransformerEncoder,
+)
+
+
+class Transformer(nn.Module):
+    r"""Basic Encoder-Only Transformer Model for text classification/regression.
+    As all other models in the library this model can be used as the
+    `deeptext` component of a Wide & Deep model or independently by itself.
+
+    **NOTE**: This model is introduced in the context of recommendation
+      systems and thought for sequences of any nature (e.g. items). It can,
+      of course, still be used for text. However, at this stage, we have
+      decided to not include the possibility of loading pretrained word
+      vectors since we aim to integrate the library wit Huggingface in the
+      (hopefully) near future
+
+    Parameters
+    ----------
+    vocab_size: int
+        Number of words in the vocabulary
+    input_dim: int
+        Dimension of the token embeddings
+
+        Param aliases: `embed_dim`, `d_model`. <br/>
+
+    seq_length: int, Optional, default = None
+        Input sequence length
+    n_heads: int, default = 8
+        Number of attention heads per Transformer block
+    n_blocks: int, default = 4
+        Number of Transformer blocks
+    attn_dropout: float, default = 0.2
+        Dropout that will be applied to the Multi-Head Attention layers
+    ff_dropout: float, default = 0.1
+        Dropout that will be applied to the FeedForward network
+    ff_factor: float, default = 4
+        Multiplicative factor applied to the first layer of the FF network in
+        each Transformer block, This is normally set to 4.
+    activation: str, default = "gelu"
+        Transformer Encoder activation function. _'tanh'_, _'relu'_,
+        _'leaky_relu'_, _'gelu'_, _'geglu'_ and _'reglu'_ are supported
+    with_cls_token: bool, default = False
+        Boolean indicating if a `'[CLS]'` token is included in the tokenized
+        sequences. If present, the final hidden state corresponding to this
+        token is used as the aggregated representation for classification and
+        regression tasks. **NOTE**: if included in the tokenized sequences it
+        must be inserted as the first token in the sequences.
+    with_pos_encoding: bool, default = True
+        Boolean indicating if positional encoding will be used
+    pos_encoding_dropout: float, default = 0.1
+        Positional encoding dropout
+    pos_encoder: nn.Module, Optional, default = None
+        This model uses by default a standard positional encoding approach.
+        However, any custom positional encoder can also be used and pass to
+        the Transformer model via the 'pos_encoder' parameter
+
+    Attributes
+    ----------
+    embedding: nn.Module
+        Standard token embedding layer
+    pos_encoder: nn.Module
+        Positional Encoder
+    encoder: nn.Module
+        Sequence of Transformer blocks
+    """
+
+    @Alias("input_dim", ["embed_dim", "d_model"])
+    @Alias("seq_length", ["max_length", "maxlen"])
+    def __init__(
+        self,
+        vocab_size: int,
+        seq_length: int,
+        input_dim: int,
+        n_heads: int,
+        n_blocks: int,
+        attn_dropout: float = 0.1,
+        ff_dropout: float = 0.1,
+        ff_factor: int = 4,
+        activation: str = "gelu",
+        with_cls_token: bool = False,
+        *,  # from here on pos encoding args
+        with_pos_encoding: bool = True,
+        pos_encoding_dropout: float = 0.1,
+        pos_encoder: Optional[nn.Module] = None,
+    ):
+        super().__init__()
+
+        self.input_dim = input_dim
+        self.seq_length = seq_length
+        self.n_heads = n_heads
+        self.n_blocks = n_blocks
+        self.attn_dropout = attn_dropout
+        self.ff_dropout = ff_dropout
+        self.ff_factor = ff_factor
+        self.activation = activation
+        self.with_cls_token = with_cls_token
+        self.with_pos_encoding = with_pos_encoding
+        self.pos_encoding_dropout = pos_encoding_dropout
+
+        self.embedding = nn.Embedding(vocab_size, input_dim)
+
+        if with_pos_encoding:
+            if pos_encoder is not None:
+                self.pos_encoder: Union[
+                    nn.Module, nn.Identity, PositionalEncoding
+                ] = pos_encoder
+            else:
+                self.pos_encoder = PositionalEncoding(
+                    input_dim, pos_encoding_dropout, seq_length
+                )
+        else:
+            self.pos_encoder = nn.Identity()
+
+        self.encoder = nn.Sequential()
+        for i in range(n_blocks):
+            self.encoder.add_module(
+                "transformer_block" + str(i),
+                TransformerEncoder(
+                    input_dim,
+                    n_heads,
+                    False,  # use_qkv_bias
+                    attn_dropout,
+                    ff_dropout,
+                    ff_factor,
+                    activation,
+                ),
+            )
+
+    def forward(self, X: Tensor) -> Tensor:
+        x = self.embedding(X)
+        x = self.pos_encoder(x)
+        x = self.encoder(x)
+        if self.with_cls_token:
+            x = x[:, 0, :]
+        else:
+            x = x.flatten(1)
+        return x
+
+    @property
+    def output_dim(self) -> int:
+        if self.with_cls_token:
+            output_dim = self.input_dim
+        else:
+            output_dim = self.input_dim * self.seq_length
+        return output_dim
+
+
+class PositionalEncoding(nn.Module):
+    """Positional Encoding copied and pasted directly from [The Beginners'
+    Tutorial]
+    (https://pytorch.org/tutorials/beginner/transformer_tutorial.html) at the
+    Pytorch site. Here is simply adapated so that the input sequence length
+    must be specified and in our implementation the input tensor dimensions
+    are arranged as `[batch_size, seq_len, embedding_dim]` instead of `
+    [seq_len, batch_size, embedding_dim]` , as in the before mentioned
+    tutorial
+
+    Parameters
+    ----------
+    input_dim: int
+        Dimension of the token embeddings
+    dropout: float
+        Positional encoding dropout
+    seq_length: int
+        input sequence length
+
+    """
+
+    def __init__(self, input_dim: int, dropout: float, seq_length: int):
+        super().__init__()
+        self.dropout = nn.Dropout(p=dropout)
+
+        position = torch.arange(seq_length).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, input_dim, 2) * (-math.log(10000.0) / input_dim)
+        )
+        pe = torch.zeros(1, seq_length, input_dim)
+        pe[0, :, 0::2] = torch.sin(position * div_term)
+        pe[0, :, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+
+    def forward(self, X: Tensor) -> Tensor:
+        return self.dropout(X + self.pe)
diff --git a/pytorch_widedeep/preprocessing/text_preprocessor.py b/pytorch_widedeep/preprocessing/text_preprocessor.py
index f713ba84864b89ebc032d5866714dda6b014e1cf..364e6faae5528363303e1aa8b3ffa36a085fd7a4 100644
--- a/pytorch_widedeep/preprocessing/text_preprocessor.py
+++ b/pytorch_widedeep/preprocessing/text_preprocessor.py
@@ -9,6 +9,7 @@ from pytorch_widedeep.utils.text_utils import (
     pad_sequences,
     build_embeddings_matrix,
 )
+from pytorch_widedeep.utils.general_utils import Alias
 from pytorch_widedeep.utils.fastai_transforms import Vocab
 from pytorch_widedeep.preprocessing.base_preprocessor import (
     BasePreprocessor,
@@ -34,6 +35,13 @@ class TextPreprocessor(BasePreprocessor):
         end of the sequences
     pad_idx: int, default = 1
         padding index. Fastai's Tokenizer leaves 0 for the 'unknown' token.
+    already_processed: bool, Optional, default = False
+        Boolean indicating if the sequence of elements is already processed or
+        prepared. If this is the case, this Preprocessor will simply tokenize
+        and pad the sequence.
+
+        Param aliases: `not_text`. <br/>
+
     word_vectors_path: str, Optional
         Path to the pretrained word vectors
     n_cpus: int, Optional, default = None
@@ -66,6 +74,7 @@ class TextPreprocessor(BasePreprocessor):
     array([[ 1,  1,  9, 16, 17, 18, 11,  0,  0, 13]], dtype=int32)
     """
 
+    @Alias("already_processed", "not_text")
     def __init__(
         self,
         text_col: str,
@@ -74,6 +83,7 @@ class TextPreprocessor(BasePreprocessor):
         maxlen: int = 80,
         pad_first: bool = True,
         pad_idx: int = 1,
+        already_processed: Optional[bool] = False,
         word_vectors_path: Optional[str] = None,
         n_cpus: Optional[int] = None,
         verbose: int = 1,
@@ -86,6 +96,7 @@ class TextPreprocessor(BasePreprocessor):
         self.maxlen = maxlen
         self.pad_first = pad_first
         self.pad_idx = pad_idx
+        self.already_processed = already_processed
         self.word_vectors_path = word_vectors_path
         self.verbose = verbose
         self.n_cpus = n_cpus if n_cpus is not None else os.cpu_count()
@@ -104,9 +115,12 @@ class TextPreprocessor(BasePreprocessor):
             `TextPreprocessor` fitted object
         """
         texts = df[self.text_col].tolist()
-        tokens = get_texts(texts, self.n_cpus)
+        tokens = get_texts(texts, self.already_processed, self.n_cpus)
         self.vocab = Vocab.create(
-            tokens, max_vocab=self.max_vocab, min_freq=self.min_freq
+            tokens,
+            max_vocab=self.max_vocab,
+            min_freq=self.min_freq,
+            pad_idx=self.pad_idx,
         )
         if self.verbose:
             print("The vocabulary contains {} tokens".format(len(self.vocab.stoi)))
@@ -131,7 +145,7 @@ class TextPreprocessor(BasePreprocessor):
         """
         check_is_fitted(self, attributes=["vocab"])
         texts = df[self.text_col].tolist()
-        self.tokens = get_texts(texts, self.n_cpus)
+        self.tokens = get_texts(texts, self.already_processed, self.n_cpus)
         sequences = [self.vocab.numericalize(t) for t in self.tokens]
         padded_seq = np.array(
             [
diff --git a/pytorch_widedeep/training/trainer.py b/pytorch_widedeep/training/trainer.py
index 5dd45ae3e77e4fcb34c73410c0e3d1d79ddaf0ae..6508b88ac4a400d900514102b16271f9a136b636 100644
--- a/pytorch_widedeep/training/trainer.py
+++ b/pytorch_widedeep/training/trainer.py
@@ -1127,6 +1127,7 @@ class Trainer(BaseTrainer):
     @staticmethod
     def _extract_kwargs(kwargs):
         dataloader_params = [
+            "shuffle",
             "sampler",
             "batch_sampler",
             "num_workers",
diff --git a/pytorch_widedeep/utils/fastai_transforms.py b/pytorch_widedeep/utils/fastai_transforms.py
index dc128e6ba73529b33c84c25f354f2b94ce4800e1..087df37a94f2224644f15882691556f193c836cf 100644
--- a/pytorch_widedeep/utils/fastai_transforms.py
+++ b/pytorch_widedeep/utils/fastai_transforms.py
@@ -338,6 +338,7 @@ class Tokenizer:
             )
 
 
+# TODO: Fix bug regarding token num 0
 class Vocab:
     r"""Contains the correspondence between numbers and tokens.
 
@@ -390,7 +391,13 @@ class Vocab:
         pickle.dump(self.itos, open(path, "wb"))
 
     @classmethod
-    def create(cls, tokens: Tokens, max_vocab: int, min_freq: int) -> "Vocab":
+    def create(
+        cls,
+        tokens: Tokens,
+        max_vocab: int,
+        min_freq: int,
+        pad_idx: Optional[int] = None,
+    ) -> "Vocab":
         r"""Create a vocabulary object from a set of tokens.
 
         Parameters
@@ -401,9 +408,9 @@ class Vocab:
             strings (e.g. list of tokenized sentences)
         max_vocab: int
             maximum vocabulary size
-        min_freq: int
-            minimum frequency that a token has to appear to be part of the
-            vocabulary
+        pad_idx: int, Optional, default = None
+            padding index. If None, Fastai's Tokenizer leaves 0 for
+            the 'unknown' token and defaults to 1.
 
         Examples
         --------
@@ -426,12 +433,18 @@ class Vocab:
         Vocab
             An instance of a `Vocab` object
         """
+
         freq = Counter(p for o in tokens for p in o)
         itos = [o for o, c in freq.most_common(max_vocab) if c >= min_freq]
         for o in reversed(defaults.text_spec_tok):
             if o in itos:
                 itos.remove(o)
             itos.insert(0, o)
+
+        if pad_idx is not None:
+            itos.remove(PAD)
+            itos.insert(pad_idx, PAD)
+
         itos = itos[:max_vocab]
         if (
             len(itos) < max_vocab
diff --git a/pytorch_widedeep/utils/text_utils.py b/pytorch_widedeep/utils/text_utils.py
index 06ae5b5a3efd6ea4ec8c9e76e7caeaa7f7776613..ab588aee46d7251c10389bf4e5ea99cc4462ab3d 100644
--- a/pytorch_widedeep/utils/text_utils.py
+++ b/pytorch_widedeep/utils/text_utils.py
@@ -54,7 +54,11 @@ def simple_preprocess(
     return tokens
 
 
-def get_texts(texts: List[str], n_cpus: Optional[int] = None) -> List[List[str]]:
+def get_texts(
+    texts: List[str],
+    already_processed: Optional[bool] = False,
+    n_cpus: Optional[int] = None,
+) -> List[List[str]]:
     r"""Tokenization using `Fastai`'s `Tokenizer` because it does a
     series of very convenients things during the tokenization process
 
@@ -64,6 +68,9 @@ def get_texts(texts: List[str], n_cpus: Optional[int] = None) -> List[List[str]]
     ----------
     texts: List
         List of str with the texts (or documents). One str per document
+    already_processed: bool, Optional, default = False
+        Boolean indicating if the text is already processed and we simply
+        want to tokenize it
     n_cpus: int, Optional, default = None
         number of CPUs to used during the tokenization process
 
@@ -89,8 +96,11 @@ def get_texts(texts: List[str], n_cpus: Optional[int] = None) -> List[List[str]]
 
     num_cpus = n_cpus if n_cpus is not None else os.cpu_count()
 
-    processed_textx = [" ".join(simple_preprocess(t)) for t in texts]
-    tok = Tokenizer(n_cpus=num_cpus).process_all(processed_textx)
+    if not already_processed:
+        processed_texts = [" ".join(simple_preprocess(t)) for t in texts]
+    else:
+        processed_texts = texts
+    tok = Tokenizer(n_cpus=num_cpus).process_all(processed_texts)
     return tok
 
 
diff --git a/pytorch_widedeep/version.py b/pytorch_widedeep/version.py
index 67bc602abf06e9bcea675fe21c56a2f3c76bc331..9c73af26be70465839a5f43818dbab3f5c35571f 100644
--- a/pytorch_widedeep/version.py
+++ b/pytorch_widedeep/version.py
@@ -1 +1 @@
-__version__ = "1.3.0"
+__version__ = "1.3.1"
diff --git a/tests/test_datasets/test_datasets.py b/tests/test_datasets/test_datasets.py
index 516535ea24cd85a6ae767ead458d486e4ba48429..f22c61acc685b7d491069bfb1854b638b2bd9319 100644
--- a/tests/test_datasets/test_datasets.py
+++ b/tests/test_datasets/test_datasets.py
@@ -8,6 +8,7 @@ from pytorch_widedeep.datasets import (
     load_birds,
     load_ecoli,
     load_bio_kdd04,
+    load_movielens100k,
     load_womens_ecommerce,
     load_california_housing,
 )
@@ -116,3 +117,46 @@ def test_load_california_housing(as_frame):
         assert (df.shape, type(df)) == ((20640, 9), pd.DataFrame)
     else:
         assert (df.shape, type(df)) == ((20640, 9), np.ndarray)
+
+
+@pytest.mark.parametrize(
+    "as_frame",
+    [
+        (True),
+        (False),
+    ],
+)
+def test_load_movielens100k(as_frame):
+    df_data, df_users, df_items = load_movielens100k(as_frame=as_frame)
+    if as_frame:
+        assert (
+            df_data.shape,
+            df_users.shape,
+            df_items.shape,
+            type(df_data),
+            type(df_users),
+            type(df_items),
+        ) == (
+            (100000, 4),
+            (943, 5),
+            (1682, 24),
+            pd.DataFrame,
+            pd.DataFrame,
+            pd.DataFrame,
+        )
+    else:
+        assert (
+            df_data.shape,
+            df_users.shape,
+            df_items.shape,
+            type(df_data),
+            type(df_users),
+            type(df_items),
+        ) == (
+            (100000, 4),
+            (943, 5),
+            (1682, 24),
+            np.ndarray,
+            np.ndarray,
+            np.ndarray,
+        )
diff --git a/tests/test_model_components/test_mc_text.py b/tests/test_model_components/test_mc_text.py
index 39c37b25c9a0e77f45b071d5a97c86b3e0b3718c..f487633c913b39846a00c44c4a2b2ed8a345cdf9 100644
--- a/tests/test_model_components/test_mc_text.py
+++ b/tests/test_model_components/test_mc_text.py
@@ -2,7 +2,12 @@ import numpy as np
 import torch
 import pytest
 
-from pytorch_widedeep.models import BasicRNN, AttentiveRNN, StackedAttentiveRNN
+from pytorch_widedeep.models import (
+    BasicRNN,
+    Transformer,
+    AttentiveRNN,
+    StackedAttentiveRNN,
+)
 
 padded_sequences = np.random.choice(np.arange(1, 100), (100, 48))
 padded_sequences = np.hstack(
@@ -302,3 +307,80 @@ def test_attn_weights(stacked):
         )
     else:
         assert attn_w.size() == torch.Size([100, 50])
+
+
+# ###############################################################################
+# # Test Basic Transformer
+# ###############################################################################
+
+
+@pytest.mark.parametrize(
+    "with_cls_token",
+    [True, False],
+)
+def test_basic_transformer(with_cls_token):
+    if with_cls_token:
+        # if we use a 'CLS' token it must be inserted at the beginning of the
+        # sequence
+        _padded_sequences = np.zeros(
+            (padded_sequences.shape[0], padded_sequences.shape[1] + 1), dtype=int
+        )
+        _padded_sequences[:, 0] = padded_sequences.max() + 1
+        _padded_sequences[:, 1:] = padded_sequences
+    else:
+        _padded_sequences = padded_sequences
+
+    model = Transformer(
+        vocab_size=_padded_sequences.max() + 1,
+        seq_length=_padded_sequences.shape[1],
+        input_dim=8,
+        n_heads=2,
+        n_blocks=2,
+        with_pos_encoding=False,
+        with_cls_token=with_cls_token,
+    )
+
+    out = model(torch.from_numpy(_padded_sequences))
+
+    res = []
+    res.append(out.size(0) == _padded_sequences.shape[0])
+    res.append(out.size(1) == model.output_dim)
+
+    assert all(res)
+
+
+# ###############################################################################
+# # Test Custom Positional Encoder
+# ###############################################################################
+
+
+class DummyPositionalEncoding(torch.nn.Module):
+    def __init__(self, input_dim: int, seq_length: int):
+        super().__init__()
+
+        pe = torch.ones(1, seq_length, input_dim)
+        self.register_buffer("pe", pe)
+
+    def forward(self, X):
+        return X + self.pe
+
+
+def test_custom_pos_encoder():
+    model = Transformer(
+        vocab_size=padded_sequences.max() + 1,
+        seq_length=padded_sequences.shape[1],
+        input_dim=8,
+        n_heads=2,
+        n_blocks=2,
+        pos_encoder=DummyPositionalEncoding(
+            input_dim=8, seq_length=padded_sequences.shape[1]
+        ),
+    )
+
+    out = model(torch.from_numpy(padded_sequences))
+
+    res = []
+    res.append(out.size(0) == padded_sequences.shape[0])
+    res.append(out.size(1) == model.output_dim)
+
+    assert all(res)
diff --git a/tests/test_model_functioning/test_miscellaneous.py b/tests/test_model_functioning/test_miscellaneous.py
index 3b356d9d719e1692863c7079a3a8004fc24313ae..c7a93d43bdbac97a826578f0cd1f02c50c0019b6 100644
--- a/tests/test_model_functioning/test_miscellaneous.py
+++ b/tests/test_model_functioning/test_miscellaneous.py
@@ -17,6 +17,7 @@ from pytorch_widedeep.models import (
     BasicRNN,
     WideDeep,
     TabResnet,
+    Transformer,
     TabTransformer,
 )
 from pytorch_widedeep.metrics import Accuracy, Precision
@@ -89,7 +90,16 @@ tabnet = TabNet(
     continuous_cols=colnames[5:],
     ghost_bn=False,
 )
-deeptext = BasicRNN(vocab_size=vocab_size, embed_dim=32, padding_idx=0)
+basic_rnn = BasicRNN(vocab_size=vocab_size, embed_dim=32, padding_idx=0)
+
+basic_transformer = Transformer(
+    vocab_size=X_text.max() + 1,
+    maxlen=X_text.shape[1],
+    embed_dim=8,
+    n_heads=2,
+    n_blocks=2,
+)
+
 deepimage = Vision(pretrained_model_setup="resnet18", n_trainable=0)
 
 ###############################################################################
@@ -209,7 +219,8 @@ def test_basic_run_with_metrics_multiclass():
         (None, tabmlp, None, None, None, X_tab, None, None, target),
         (None, tabresnet, None, None, None, X_tab, None, None, target),
         (None, tabtransformer, None, None, None, X_tab, None, None, target),
-        (None, None, deeptext, None, None, None, X_text, None, target),
+        (None, None, basic_rnn, None, None, None, X_text, None, target),
+        (None, None, basic_transformer, None, None, None, X_text, None, target),
         (None, None, None, deepimage, None, None, None, X_img, target),
     ],
 )