提交 07c44337 编写于 作者: P Pavol Mulinka

added all notebooks with correct names

上级 c19949dd
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
{
"cells": [
{
"cell_type": "markdown",
"id": "fe44d03a",
"metadata": {},
"source": [
"# Custom DataLoader for Imbalanced dataset"
]
},
{
"cell_type": "markdown",
"id": "5731376a",
"metadata": {},
"source": [
"* In this notebook we will use the higly imbalanced Protein Homology Dataset from [KDD cup 2004](https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data)\n",
"\n",
"```\n",
"* The first element of each line is a BLOCK ID that denotes to which native sequence this example belongs. There is a unique BLOCK ID for each native sequence. BLOCK IDs are integers running from 1 to 303 (one for each native sequence, i.e. for each query). BLOCK IDs were assigned before the blocks were split into the train and test sets, so they do not run consecutively in either file.\n",
"* The second element of each line is an EXAMPLE ID that uniquely describes the example. You will need this EXAMPLE ID and the BLOCK ID when you submit results.\n",
"* The third element is the class of the example. Proteins that are homologous to the native sequence are denoted by 1, non-homologous proteins (i.e. decoys) by 0. Test examples have a \"?\" in this position.\n",
"* All following elements are feature values. There are 74 feature values in each line. The features describe the match (e.g. the score of a sequence alignment) between the native protein sequence and the sequence that is tested for homology.\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "8969d2a7",
"metadata": {},
"source": [
"## Initial imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5d189eca",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/javierrodriguezzaurin/.pyenv/versions/3.8.12/envs/wd38/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import torch\n",
"from torch.optim import SGD, lr_scheduler\n",
"\n",
"from pytorch_widedeep import Trainer\n",
"from pytorch_widedeep.preprocessing import TabPreprocessor\n",
"from pytorch_widedeep.models import TabMlp, WideDeep\n",
"from pytorch_widedeep.dataloaders import DataLoaderImbalanced, DataLoaderDefault\n",
"from torchmetrics import F1Score as F1_torchmetrics\n",
"from torchmetrics import Accuracy as Accuracy_torchmetrics\n",
"from torchmetrics import Precision as Precision_torchmetrics\n",
"from torchmetrics import Recall as Recall_torchmetrics\n",
"from pytorch_widedeep.metrics import Accuracy, Recall, Precision, F1Score, R2Score\n",
"from pytorch_widedeep.initializers import XavierNormal\n",
"from pytorch_widedeep.datasets import load_bio_kdd04\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import classification_report\n",
"\n",
"import time\n",
"import datetime\n",
"\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"\n",
"# increase displayed columns in jupyter notebook\n",
"pd.set_option(\"display.max_columns\", 200)\n",
"pd.set_option(\"display.max_rows\", 300)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e72ebad3",
"metadata": {},
"outputs": [],
"source": [
"df = load_bio_kdd04(as_frame=True)\n",
"# drop columns we won't need in this example\n",
"df.drop(columns=[\"EXAMPLE_ID\", \"BLOCK_ID\"], inplace=True)\n",
"\n",
"df_train, df_valid = train_test_split(\n",
" df, test_size=0.2, stratify=df[\"target\"], random_state=1\n",
")\n",
"df_valid, df_test = train_test_split(\n",
" df_valid, test_size=0.5, stratify=df_valid[\"target\"], random_state=1\n",
")\n",
"\n",
"continuous_cols = df.drop(columns=[\"target\"]).columns.values.tolist()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "81cf3284",
"metadata": {},
"outputs": [],
"source": [
"# deeptabular\n",
"tab_preprocessor = TabPreprocessor(continuous_cols=continuous_cols, scale=True)\n",
"X_tab_train = tab_preprocessor.fit_transform(df_train)\n",
"X_tab_valid = tab_preprocessor.transform(df_valid)\n",
"X_tab_test = tab_preprocessor.transform(df_test)\n",
"\n",
"# target\n",
"y_train = df_train[\"target\"].values\n",
"y_valid = df_valid[\"target\"].values\n",
"y_test = df_test[\"target\"].values"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d202b17c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"WideDeep(\n",
" (deeptabular): Sequential(\n",
" (0): TabMlp(\n",
" (cat_and_cont_embed): DiffSizeCatAndContEmbeddings(\n",
" (cont_norm): BatchNorm1d(74, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
" )\n",
" (tab_mlp): MLP(\n",
" (mlp): Sequential(\n",
" (dense_layer_0): Sequential(\n",
" (0): Dropout(p=0.1, inplace=False)\n",
" (1): Linear(in_features=74, out_features=148, bias=True)\n",
" (2): ReLU(inplace=True)\n",
" )\n",
" (dense_layer_1): Sequential(\n",
" (0): Dropout(p=0.1, inplace=False)\n",
" (1): Linear(in_features=148, out_features=118, bias=True)\n",
" (2): ReLU(inplace=True)\n",
" )\n",
" (dense_layer_2): Sequential(\n",
" (0): Dropout(p=0.1, inplace=False)\n",
" (1): Linear(in_features=118, out_features=89, bias=True)\n",
" (2): ReLU(inplace=True)\n",
" )\n",
" (dense_layer_3): Sequential(\n",
" (0): Dropout(p=0.1, inplace=False)\n",
" (1): Linear(in_features=89, out_features=59, bias=True)\n",
" (2): ReLU(inplace=True)\n",
" )\n",
" (dense_layer_4): Sequential(\n",
" (0): Dropout(p=0.1, inplace=False)\n",
" (1): Linear(in_features=59, out_features=30, bias=True)\n",
" (2): ReLU(inplace=True)\n",
" )\n",
" )\n",
" )\n",
" )\n",
" (1): Linear(in_features=30, out_features=1, bias=True)\n",
" )\n",
")"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define the model\n",
"input_layer = len(tab_preprocessor.continuous_cols)\n",
"output_layer = 1\n",
"hidden_layers = np.linspace(\n",
" input_layer * 2, output_layer, 5, endpoint=False, dtype=int\n",
").tolist()\n",
"\n",
"deeptabular = TabMlp(\n",
" column_idx=tab_preprocessor.column_idx,\n",
" continuous_cols=tab_preprocessor.continuous_cols,\n",
" mlp_hidden_dims=hidden_layers,\n",
")\n",
"model = WideDeep(deeptabular=deeptabular)\n",
"model"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c6262897",
"metadata": {},
"outputs": [],
"source": [
"# Metrics from pytorch-widedeep\n",
"accuracy = Accuracy(top_k=2)\n",
"precision = Precision(average=False)\n",
"\n",
"# # Metrics from torchmetrics\n",
"# accuracy = Accuracy_torchmetrics(average=None, num_classes=1)\n",
"# precision = Precision_torchmetrics(average=\"micro\", num_classes=1)\n",
"\n",
"# Optimizers\n",
"deep_opt = SGD(model.deeptabular.parameters(), lr=0.1)\n",
"\n",
"# LR Scheduler\n",
"deep_sch = lr_scheduler.StepLR(deep_opt, step_size=3)\n",
"\n",
"trainer = Trainer(\n",
" model,\n",
" objective=\"binary\",\n",
" lr_schedulers={\"deeptabular\": deep_sch},\n",
" initializers={\"deeptabular\": XavierNormal},\n",
" optimizers={\"deeptabular\": deep_opt},\n",
" metrics=[accuracy, precision],\n",
" verbose=1,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3c34b76f",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"epoch 1: 100%|██████████████████████████████████| 65/65 [00:00<00:00, 178.26it/s, loss=0.358, metrics={'acc': 0.8419, 'prec': [0.8503]}]\n",
"valid: 100%|█████████████████████████████████| 456/456 [00:01<00:00, 282.95it/s, loss=0.0998, metrics={'acc': 0.9853, 'prec': [0.3569]}]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training time[s]: 0:00:02\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"predict: 100%|███████████████████████████████████████████████████████████████████████████████████████| 456/456 [00:00<00:00, 566.27it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 0.99 0.99 14446\n",
" 1 0.39 0.85 0.54 130\n",
"\n",
" accuracy 0.99 14576\n",
" macro avg 0.70 0.92 0.76 14576\n",
"weighted avg 0.99 0.99 0.99 14576\n",
"\n",
"Actual predicted values:\n",
"(array([0, 1]), array([14295, 281]))\n"
]
}
],
"source": [
"start = time.time()\n",
"trainer.fit(\n",
" X_train={\"X_tab\": X_tab_train, \"target\": y_train},\n",
" X_val={\"X_tab\": X_tab_valid, \"target\": y_valid},\n",
" n_epochs=1,\n",
" batch_size=32,\n",
" custom_dataloader=DataLoaderImbalanced,\n",
" oversample_mul=5,\n",
")\n",
"print(\n",
" \"Training time[s]: {}\".format(\n",
" datetime.timedelta(seconds=round(time.time() - start))\n",
" )\n",
")\n",
"\n",
"pd.DataFrame(trainer.history)\n",
"\n",
"df_pred = trainer.predict(X_tab=X_tab_test)\n",
"print(classification_report(df_test[\"target\"].to_list(), df_pred))\n",
"print(\"Actual predicted values:\\n{}\".format(np.unique(df_pred, return_counts=True)))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -42,7 +42,7 @@ nav:
- 03_binary_classification_with_defaults: examples/03_binary_classification_with_defaults.ipynb
- 04_regression_with_images_and_text: examples/04_regression_with_images_and_text.ipynb
- 05_save_and_load_model_and_artifacts: examples/05_save_and_load_model_and_artifacts.ipynb
- 06_fineTune_and_warmup: examples/06_fineTune_and_warmup.ipynb
- 06_finetune_and_warmup: examples/06_finetune_and_warmup.ipynb
- 07_custom_components: examples/07_custom_components.ipynb
- 08_custom_dataLoader_imbalanced_dataset: examples/08_custom_dataLoader_imbalanced_dataset.ipynb
- 09_extracting_embeddings: examples/09_extracting_embeddings.ipynb
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册