Model Uncertainty prediction¶
Note:
This notebook extends the "Custom DataLoader for Imbalanced dataset" notebook
- In this notebook we will use the higly imbalanced Protein Homology Dataset from KDD cup 2004
* The first element of each line is a BLOCK ID that denotes to which native sequence this example belongs. There is a unique BLOCK ID for each native sequence. BLOCK IDs are integers running from 1 to 303 (one for each native sequence, i.e. for each query). BLOCK IDs were assigned before the blocks were split into the train and test sets, so they do not run consecutively in either file.
* The second element of each line is an EXAMPLE ID that uniquely describes the example. You will need this EXAMPLE ID and the BLOCK ID when you submit results.
* The third element is the class of the example. Proteins that are homologous to the native sequence are denoted by 1, non-homologous proteins (i.e. decoys) by 0. Test examples have a "?" in this position.
* All following elements are feature values. There are 74 feature values in each line. The features describe the match (e.g. the score of a sequence alignment) between the native protein sequence and the sequence that is tested for homology.
Initial imports¶
In [1]:
Copied!
import numpy as np
import pandas as pd
import torch
from torch.optim import SGD, lr_scheduler
from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep.dataloaders import DataLoaderImbalanced, DataLoaderDefault
from torchmetrics import F1Score as F1_torchmetrics
from torchmetrics import Accuracy as Accuracy_torchmetrics
from torchmetrics import Precision as Precision_torchmetrics
from torchmetrics import Recall as Recall_torchmetrics
from pytorch_widedeep.metrics import Accuracy, Recall, Precision, F1Score, R2Score
from pytorch_widedeep.initializers import XavierNormal
from pytorch_widedeep.datasets import load_bio_kdd04
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import time
import datetime
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# increase displayed columns in jupyter notebook
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 300)
import numpy as np
import pandas as pd
import torch
from torch.optim import SGD, lr_scheduler
from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep.dataloaders import DataLoaderImbalanced, DataLoaderDefault
from torchmetrics import F1Score as F1_torchmetrics
from torchmetrics import Accuracy as Accuracy_torchmetrics
from torchmetrics import Precision as Precision_torchmetrics
from torchmetrics import Recall as Recall_torchmetrics
from pytorch_widedeep.metrics import Accuracy, Recall, Precision, F1Score, R2Score
from pytorch_widedeep.initializers import XavierNormal
from pytorch_widedeep.datasets import load_bio_kdd04
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import time
import datetime
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# increase displayed columns in jupyter notebook
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 300)
/Users/javierrodriguezzaurin/.pyenv/versions/3.8.12/envs/wd38/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
In [2]:
Copied!
df = load_bio_kdd04(as_frame=True)
df.head()
df = load_bio_kdd04(as_frame=True)
df.head()
Out[2]:
EXAMPLE_ID | BLOCK_ID | target | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 279 | 261532 | 0 | 52.0 | 32.69 | 0.30 | 2.5 | 20.0 | 1256.8 | -0.89 | 0.33 | 11.0 | -55.0 | 267.2 | 0.52 | 0.05 | -2.36 | 49.6 | 252.0 | 0.43 | 1.16 | -2.06 | -33.0 | -123.2 | 1.60 | -0.49 | -6.06 | 65.0 | 296.1 | -0.28 | -0.26 | -3.83 | -22.6 | -170.0 | 3.06 | -1.05 | -3.29 | 22.9 | 286.3 | 0.12 | 2.58 | 4.08 | -33.0 | -178.9 | 1.88 | 0.53 | -7.0 | -44.0 | 1987.0 | -5.41 | 0.95 | -4.0 | -57.0 | 722.9 | -3.26 | -0.55 | -7.5 | 125.5 | 1547.2 | -0.36 | 1.12 | 9.0 | -37.0 | 72.5 | 0.47 | 0.74 | -11.0 | -8.0 | 1595.1 | -1.64 | 2.83 | -2.0 | -50.0 | 445.2 | -0.35 | 0.26 | 0.76 |
1 | 279 | 261533 | 0 | 58.0 | 33.33 | 0.00 | 16.5 | 9.5 | 608.1 | 0.50 | 0.07 | 20.5 | -52.5 | 521.6 | -1.08 | 0.58 | -0.02 | -3.2 | 103.6 | -0.95 | 0.23 | -2.87 | -25.9 | -52.2 | -0.21 | 0.87 | -1.81 | 10.4 | 62.0 | -0.28 | -0.04 | 1.48 | -17.6 | -198.3 | 3.43 | 2.84 | 5.87 | -16.9 | 72.6 | -0.31 | 2.79 | 2.71 | -33.5 | -11.6 | -1.11 | 4.01 | 5.0 | -57.0 | 666.3 | 1.13 | 4.38 | 5.0 | -64.0 | 39.3 | 1.07 | -0.16 | 32.5 | 100.0 | 1893.7 | -2.80 | -0.22 | 2.5 | -28.5 | 45.0 | 0.58 | 0.41 | -19.0 | -6.0 | 762.9 | 0.29 | 0.82 | -3.0 | -35.0 | 140.3 | 1.16 | 0.39 | 0.73 |
2 | 279 | 261534 | 0 | 77.0 | 27.27 | -0.91 | 6.0 | 58.5 | 1623.6 | -1.40 | 0.02 | -6.5 | -48.0 | 621.0 | -1.20 | 0.14 | -0.20 | 73.6 | 609.1 | -0.44 | -0.58 | -0.04 | -23.0 | -27.4 | -0.72 | -1.04 | -1.09 | 91.1 | 635.6 | -0.88 | 0.24 | 0.59 | -18.7 | -7.2 | -0.60 | -2.82 | -0.71 | 52.4 | 504.1 | 0.89 | -0.67 | -9.30 | -20.8 | -25.7 | -0.77 | -0.85 | 0.0 | -20.0 | 2259.0 | -0.94 | 1.15 | -4.0 | -44.0 | -22.7 | 0.94 | -0.98 | -19.0 | 105.0 | 1267.9 | 1.03 | 1.27 | 11.0 | -39.5 | 82.3 | 0.47 | -0.19 | -10.0 | 7.0 | 1491.8 | 0.32 | -1.29 | 0.0 | -34.0 | 658.2 | -0.76 | 0.26 | 0.24 |
3 | 279 | 261535 | 0 | 41.0 | 27.91 | -0.35 | 3.0 | 46.0 | 1921.6 | -1.36 | -0.47 | -32.0 | -51.5 | 560.9 | -0.29 | -0.10 | -1.11 | 124.3 | 791.6 | 0.00 | 0.39 | -1.85 | -21.7 | -44.9 | -0.21 | 0.02 | 0.89 | 133.9 | 797.8 | -0.08 | 1.06 | -0.26 | -16.4 | -74.1 | 0.97 | -0.80 | -0.41 | 66.9 | 955.3 | -1.90 | 1.28 | -6.65 | -28.1 | 47.5 | -1.91 | 1.42 | 1.0 | -30.0 | 1846.7 | 0.76 | 1.10 | -4.0 | -52.0 | -53.9 | 1.71 | -0.22 | -12.0 | 97.5 | 1969.8 | -1.70 | 0.16 | -1.0 | -32.5 | 255.9 | -0.46 | 1.57 | 10.0 | 6.0 | 2047.7 | -0.98 | 1.53 | 0.0 | -49.0 | 554.2 | -0.83 | 0.39 | 0.73 |
4 | 279 | 261536 | 0 | 50.0 | 28.00 | -1.32 | -9.0 | 12.0 | 464.8 | 0.88 | 0.19 | 8.0 | -51.5 | 98.1 | 1.09 | -0.33 | -2.16 | -3.9 | 102.7 | 0.39 | -1.22 | -3.39 | -15.2 | -42.2 | -1.18 | -1.11 | -3.55 | 8.9 | 141.3 | -0.16 | -0.43 | -4.15 | -12.9 | -13.4 | -1.32 | -0.98 | -3.69 | 8.8 | 136.1 | -0.30 | 4.13 | 1.89 | -13.0 | -18.7 | -1.37 | -0.93 | 0.0 | -1.0 | 810.1 | -2.29 | 6.72 | 1.0 | -23.0 | -29.7 | 0.58 | -1.10 | -18.5 | 33.5 | 206.8 | 1.84 | -0.13 | 4.0 | -29.0 | 30.1 | 0.80 | -0.24 | 5.0 | -14.0 | 479.5 | 0.68 | -0.59 | 2.0 | -36.0 | -6.9 | 2.02 | 0.14 | -0.23 |
In [3]:
Copied!
# imbalance of the classes
df["target"].value_counts()
# imbalance of the classes
df["target"].value_counts()
Out[3]:
0 144455 1 1296 Name: target, dtype: int64
In [4]:
Copied!
# drop columns we won't need in this example
df.drop(columns=["EXAMPLE_ID", "BLOCK_ID"], inplace=True)
# drop columns we won't need in this example
df.drop(columns=["EXAMPLE_ID", "BLOCK_ID"], inplace=True)
In [5]:
Copied!
df_train, df_valid = train_test_split(
df, test_size=0.2, stratify=df["target"], random_state=1
)
df_valid, df_test = train_test_split(
df_valid, test_size=0.5, stratify=df_valid["target"], random_state=1
)
df_train, df_valid = train_test_split(
df, test_size=0.2, stratify=df["target"], random_state=1
)
df_valid, df_test = train_test_split(
df_valid, test_size=0.5, stratify=df_valid["target"], random_state=1
)
Preparing the data¶
In [6]:
Copied!
continuous_cols = df.drop(columns=["target"]).columns.values.tolist()
continuous_cols = df.drop(columns=["target"]).columns.values.tolist()
In [7]:
Copied!
# deeptabular
tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols, scale=True)
X_tab_train = tab_preprocessor.fit_transform(df_train)
X_tab_valid = tab_preprocessor.transform(df_valid)
X_tab_test = tab_preprocessor.transform(df_test)
# target
y_train = df_train["target"].values
y_valid = df_valid["target"].values
y_test = df_test["target"].values
# deeptabular
tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols, scale=True)
X_tab_train = tab_preprocessor.fit_transform(df_train)
X_tab_valid = tab_preprocessor.transform(df_valid)
X_tab_test = tab_preprocessor.transform(df_test)
# target
y_train = df_train["target"].values
y_valid = df_valid["target"].values
y_test = df_test["target"].values
Define the model¶
In [8]:
Copied!
input_layer = len(tab_preprocessor.continuous_cols)
output_layer = 1
hidden_layers = np.linspace(
input_layer * 2, output_layer, 5, endpoint=False, dtype=int
).tolist()
input_layer = len(tab_preprocessor.continuous_cols)
output_layer = 1
hidden_layers = np.linspace(
input_layer * 2, output_layer, 5, endpoint=False, dtype=int
).tolist()
In [9]:
Copied!
deeptabular = TabMlp(
mlp_hidden_dims=hidden_layers,
column_idx=tab_preprocessor.column_idx,
continuous_cols=tab_preprocessor.continuous_cols,
)
model = WideDeep(deeptabular=deeptabular, pred_dim=1)
model
deeptabular = TabMlp(
mlp_hidden_dims=hidden_layers,
column_idx=tab_preprocessor.column_idx,
continuous_cols=tab_preprocessor.continuous_cols,
)
model = WideDeep(deeptabular=deeptabular, pred_dim=1)
model
Out[9]:
WideDeep( (deeptabular): Sequential( (0): TabMlp( (cat_and_cont_embed): DiffSizeCatAndContEmbeddings( (cont_norm): BatchNorm1d(74, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (encoder): MLP( (mlp): Sequential( (dense_layer_0): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=74, out_features=148, bias=True) (2): ReLU(inplace=True) ) (dense_layer_1): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=148, out_features=118, bias=True) (2): ReLU(inplace=True) ) (dense_layer_2): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=118, out_features=89, bias=True) (2): ReLU(inplace=True) ) (dense_layer_3): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=89, out_features=59, bias=True) (2): ReLU(inplace=True) ) (dense_layer_4): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=59, out_features=30, bias=True) (2): ReLU(inplace=True) ) ) ) ) (1): Linear(in_features=30, out_features=1, bias=True) ) )
In [10]:
Copied!
# # Metrics from torchmetrics
# accuracy = Accuracy_torchmetrics(average=None, num_classes=1)
# precision = Precision_torchmetrics(average="micro", num_classes=1)
# f1 = F1_torchmetrics(average=None, num_classes=1)
# recall = Recall_torchmetrics(average=None, num_classes=1)
# # Metrics from torchmetrics
# accuracy = Accuracy_torchmetrics(average=None, num_classes=1)
# precision = Precision_torchmetrics(average="micro", num_classes=1)
# f1 = F1_torchmetrics(average=None, num_classes=1)
# recall = Recall_torchmetrics(average=None, num_classes=1)
In [11]:
Copied!
# Metrics from pytorch-widedeep
accuracy = Accuracy(top_k=2)
precision = Precision(average=False)
recall = Recall(average=True)
f1 = F1Score(average=False)
# Metrics from pytorch-widedeep
accuracy = Accuracy(top_k=2)
precision = Precision(average=False)
recall = Recall(average=True)
f1 = F1Score(average=False)
In [12]:
Copied!
# Optimizers
deep_opt = SGD(model.deeptabular.parameters(), lr=0.1)
# LR Scheduler
deep_sch = lr_scheduler.StepLR(deep_opt, step_size=3)
trainer = Trainer(
model,
objective="binary",
lr_schedulers={"deeptabular": deep_sch},
initializers={"deeptabular": XavierNormal},
optimizers={"deeptabular": deep_opt},
metrics=[accuracy, precision, recall, f1],
verbose=1,
)
# Optimizers
deep_opt = SGD(model.deeptabular.parameters(), lr=0.1)
# LR Scheduler
deep_sch = lr_scheduler.StepLR(deep_opt, step_size=3)
trainer = Trainer(
model,
objective="binary",
lr_schedulers={"deeptabular": deep_sch},
initializers={"deeptabular": XavierNormal},
optimizers={"deeptabular": deep_opt},
metrics=[accuracy, precision, recall, f1],
verbose=1,
)
In [13]:
Copied!
start = time.time()
trainer.fit(
X_train={"X_tab": X_tab_train, "target": y_train},
X_val={"X_tab": X_tab_valid, "target": y_valid},
n_epochs=3,
batch_size=50,
custom_dataloader=DataLoaderImbalanced,
oversample_mul=5,
)
print(
"Training time[s]: {}".format(
datetime.timedelta(seconds=round(time.time() - start))
)
)
start = time.time()
trainer.fit(
X_train={"X_tab": X_tab_train, "target": y_train},
X_val={"X_tab": X_tab_valid, "target": y_valid},
n_epochs=3,
batch_size=50,
custom_dataloader=DataLoaderImbalanced,
oversample_mul=5,
)
print(
"Training time[s]: {}".format(
datetime.timedelta(seconds=round(time.time() - start))
)
)
epoch 1: 100%|██████████| 42/42 [00:00<00:00, 154.80it/s, loss=0.419, metrics={'acc': 0.8134, 'prec': [0.8417], 'rec': 0.7671, 'f1': [0.8027]}] valid: 100%|███████████| 292/292 [00:01<00:00, 220.78it/s, loss=0.608, metrics={'acc': 0.6963, 'prec': [0.0275], 'rec': 0.969, 'f1': [0.0535]}] epoch 2: 100%|██████████████| 42/42 [00:00<00:00, 170.29it/s, loss=0.24, metrics={'acc': 0.905, 'prec': [0.9151], 'rec': 0.895, 'f1': [0.905]}] valid: 100%|██████████| 292/292 [00:01<00:00, 218.21it/s, loss=0.0608, metrics={'acc': 0.9883, 'prec': [0.4156], 'rec': 0.7829, 'f1': [0.543]}] epoch 3: 100%|████████████| 42/42 [00:00<00:00, 169.71it/s, loss=0.193, metrics={'acc': 0.9219, 'prec': [0.9292], 'rec': 0.909, 'f1': [0.919]}] valid: 100%|██████████| 292/292 [00:01<00:00, 217.74it/s, loss=0.0865, metrics={'acc': 0.9726, 'prec': [0.2229], 'rec': 0.845, 'f1': [0.3528]}]
Training time[s]: 0:00:05
In [14]:
Copied!
pd.DataFrame(trainer.history)
pd.DataFrame(trainer.history)
Out[14]:
train_loss | train_acc | train_prec | train_rec | train_f1 | val_loss | val_acc | val_prec | val_rec | val_f1 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.419141 | 0.813404 | [0.8417112231254578] | 0.767057 | [0.8026517033576965] | 0.607627 | 0.696329 | [0.027490653097629547] | 0.968992 | [0.05346449464559555] |
1 | 0.240124 | 0.905014 | [0.9151219725608826] | 0.895038 | [0.9049686193466187] | 0.060850 | 0.988336 | [0.4156378507614136] | 0.782946 | [0.5430107712745667] |
2 | 0.193189 | 0.921890 | [0.9292214512825012] | 0.909001 | [0.918999969959259] | 0.086451 | 0.972556 | [0.22290389239788055] | 0.844961 | [0.3527508080005646] |
"Normal" prediction¶
In [15]:
Copied!
df_pred = trainer.predict(X_tab=X_tab_test)
print(classification_report(df_test["target"].to_list(), df_pred))
print("Actual predicted values:\n{}".format(np.unique(df_pred, return_counts=True)))
df_pred = trainer.predict(X_tab=X_tab_test)
print(classification_report(df_test["target"].to_list(), df_pred))
print("Actual predicted values:\n{}".format(np.unique(df_pred, return_counts=True)))
predict: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:00<00:00, 434.55it/s]
precision recall f1-score support 0 1.00 0.98 0.99 14446 1 0.26 0.91 0.40 130 accuracy 0.98 14576 macro avg 0.63 0.94 0.69 14576 weighted avg 0.99 0.98 0.98 14576 Actual predicted values: (array([0, 1]), array([14117, 459]))
Prediction using uncertainty¶
In [16]:
Copied!
df_pred_unc = trainer.predict_uncertainty(X_tab=X_tab_test, uncertainty_granularity=10)
print(classification_report(df_test["target"].to_list(), df_pred))
print(
"Actual predicted values:\n{}".format(
np.unique(df_pred_unc[:, -1], return_counts=True)
)
)
df_pred_unc = trainer.predict_uncertainty(X_tab=X_tab_test, uncertainty_granularity=10)
print(classification_report(df_test["target"].to_list(), df_pred))
print(
"Actual predicted values:\n{}".format(
np.unique(df_pred_unc[:, -1], return_counts=True)
)
)
predict_UncertaintyIter: 100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.88it/s]
precision recall f1-score support 0 1.00 0.98 0.99 14446 1 0.26 0.91 0.40 130 accuracy 0.98 14576 macro avg 0.63 0.94 0.69 14576 weighted avg 0.99 0.98 0.98 14576 Actual predicted values: (array([0.]), array([14576]))
In [17]:
Copied!
df_pred_unc
df_pred_unc
Out[17]:
array([[0.94406313, 0.0559369 , 0. ], [0.99556708, 0.00443294, 0. ], [0.97940278, 0.02059719, 0. ], ..., [0.97945035, 0.02054968, 0. ], [0.99597085, 0.00402915, 0. ], [0.98828971, 0.01171026, 0. ]])