From e74878d42e7eb4353185950f94eea377e0312379 Mon Sep 17 00:00:00 2001 From: PyCaret Date: Tue, 25 Feb 2020 18:51:16 -0500 Subject: [PATCH] Delete Binary Classification Tutorial (CLF102) - Level Intermediate.ipynb --- ...torial (CLF102) - Level Intermediate.ipynb | 16237 ---------------- 1 file changed, 16237 deletions(-) delete mode 100644 Tutorials/Binary Classification Tutorial (CLF102) - Level Intermediate.ipynb diff --git a/Tutorials/Binary Classification Tutorial (CLF102) - Level Intermediate.ipynb b/Tutorials/Binary Classification Tutorial (CLF102) - Level Intermediate.ipynb deleted file mode 100644 index d158538..0000000 --- a/Tutorials/Binary Classification Tutorial (CLF102) - Level Intermediate.ipynb +++ /dev/null @@ -1,16237 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Binary Classification Tutorial (CLF102) - Level Intermediate" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Date Updated: Feb 25, 2020**\n", - "\n", - "# 1.0 Objective of Tutorial\n", - "Welcome to Binary Classification Tutorial (#CLF102). This tutorial assumes that you have completed __[Binary Classification Tutorial - Level Beginner (#CLF101)](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__. If you haven't used PyCaret before and this is your first tutorial, we strongly recommend you to go back and progress through the beginner's tutorial to understand basics of working in PyCaret.\n", - "\n", - "In this tutorial using `pycaret.classification` module we will learn:\n", - "\n", - "* **Normalization:** How to normalize and scale the dataset?\n", - "* **Transformation:** How to apply transformations that makes data linear and approximately normal?\n", - "* **Ignore Low Variance:** How to remove features with statistically insignificant variances to make experiment more efficient?\n", - "* **Remove Multi-collinearity:** How to remove multi-collinearity from the dataset to boost performance of Linear algorithms?\n", - "* **Group Features:** How to extract statistical information from related features in the dataset?\n", - "* **Bin Numeric Variables:** How to bin numeric variables and transform numeric features into categorical using 'sturges' rule? \n", - "* **Model Ensembling and Stacking:** How to boost model performance using several ensembling techniques such as Bagging, Boosting, Soft/hard Voting and Generalized Stacking.\n", - "* **Tuning Hyperparameters of Ensemblers:** How to tune hyperparameters of ensemblers? \n", - "* **Model Calibration:** How to calibrate probabilities of a classification model?\n", - "* **Save / Load Experiment:** How to save/load an entire experiment?\n", - "\n", - "Read Time : Approx 60 Minutes\n", - "\n", - "\n", - "## 1.1 Installing PyCaret\n", - "If you haven't installed PyCaret yet. Please follow the link to __[Beginner's Tutorial](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ for instruction on how to install pycaret.\n", - "\n", - "## 1.2 Pre-Requisites\n", - "- Python 3.x\n", - "- Latest version of pycaret\n", - "- Internet connection to load data from pycaret's repository\n", - "- Completion of Binary Classification Tutorial (CLF101) - Level Beginner\n", - "\n", - "## 1.3 For Google colab users:\n", - "If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.
\n", - "
\n", - "`from pycaret.utils import enable_colab`
\n", - "`enable_colab()`\n", - "\n", - "\n", - "## 1.4 See also:\n", - "- __[Binary Classification Tutorial (CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__\n", - "- __[Binary Classification Tutorial (CLF103) - Level Expert](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF103)%20-%20Level%20Expert.ipynb)__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 2.0 Brief overview of techniques covered in this tutorial\n", - "Before we practically get into execution of the techniques mentioned above in the Section 1 (Objective of Tutorial), it is important to understand what are these techniques and when to use them. More often than not most of these techniques will help linear and parametric algorithms, however it is no surprise to see performance gain in tree-based models. Below explanations are only brief and we recommend you to do extra reading to dive deeper to get more understanding of these techniques.\n", - "\n", - "- **Normalization:** Normalization / Scaling (often used interchangeably with standardization) is used to transform the actual values of numeric variables in a way that the transformed values have helpful properties for machine learning. Many machine learning algorithms such as Logistic Regression, Support Vector Machine, K Nearest Neighbors and Naive Bayes assume that all features are centered around zero and have variance in the same order. If a particular feature in dataset has a variance that is larger from other features in order of magnitude, the model may not understand all features correctly and may perform poorly. For example, in the dataset we are using for this example, it contains `AGE` feature which ranges between 21 to 79 while other numeric features are in range of 10,000 to 1,000,000. __[Read more](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html#z-score-standardization-or-min-max-scaling)__
\n", - "
\n", - "- **Transformation:** While normalization transforms the range of data to eradicate the impact of magnitude in variance, transformation is more radical technique as it changes the shape of the distribution so that transformed data can be represented by normal or approximate normal distirbution. In general, you should transform the data if you're going to use algorithms that assumes normality or gaussian distribution. Some examples of such models are Logistic Regression, Linear Discriminant Analysis (LDA) and Gaussian Naive Bayes. (Pro tip: any method with “Gaussian” in the name probably assumes normality.) __[Read more](https://en.wikipedia.org/wiki/Power_transform)__
\n", - "
\n", - "- **Ignore Low Variance:** Datasets come sometimes with categorical features that take an unique value across samples. This kind of feature is not only non-informative and adds no value but are also sometimes harmful for few algorithms. Imagine a feature with only one unique value or few dominant unique values accross samples, they can be removed from the dataset by using ignore low variance feature in PyCaret.
\n", - "
\n", - "- **Multi-collinearity:** Multi-collinearity is a state of very high intercorrelations or inter-associations among the independent features in the dataset. It is therefore a type of disturbance in the data that is not handled well by machine learning models (mostly linear algorithms). Multi-collinearity may reduce overall coefficient of the model and cause unpredictable variance. This will lead to overfitting where the model may do great on known training set but will fail at unknown testing set. __[Read more](https://towardsdatascience.com/multicollinearity-in-data-science-c5f6c0fe6edf)__
\n", - "
\n", - "- **Group Features:** Sometimes dataset may contain features that are related at a sample level, for example in the `credit` dataset we are using it containts feature `BILL_AMT1 .. BILL_AMT6` which is related in a way that `BILL_AMT1` is amount of bill 1 month ago and `BILL_AMT6` is amount of bill 6 months ago. Such features can be used to extract additional features based on statistical properties of the distribution such as mean, median, variance, standard deviation etc.
\n", - "
\n", - "- **Bin Numeric Variables:** Binning or discretization is the process of transforming numerical variables into categorical features, An example would be Age variable. Age is a continious distribution of numeric value that can be discretize into intervals (10-20 years, 21-30 etc.). Binning may improve accuracy of the predictive models by reducing the noise or non-linearity in the data. PyCaret automatically determines the number and size of bins using Sturges rule. __[Read more](https://www.vosesoftware.com/riskwiki/Sturgesrule.php)__
\n", - "
\n", - "- **Model Ensembling and Stacking:** Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different sample of training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error of the model decreases when the ensemble approach is used. Two most common method in ensemble learning are `Bagging` and `Boosting`. Stacking is also a type of ensemble learning where predictions from multiple models are used as an input feature for a meta model that predicts the final outcome using input from base learners. __[Read more](https://blog.statsbot.co/ensemble-learning-d1dcd548e936)__
\n", - "
\n", - "- **Tuning Hyperparameters of Ensemblers:** Similar to hyperparameters of a machine learning model, we will also learn how to tune hyperparameter of an ensembler model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 3.0 Dataset for the Tutorial" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For this tutorial we will be using the same dataset that was used in __[CLF101 - Beginner Tutorial.](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__\n", - "\n", - "#### Dataset Acknowledgements:\n", - "Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.\n", - "\n", - "The original dataset and data dictionary can be __[found here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__ at the UCI Machine Learning Repository." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 4.0 Getting the Data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can download the data from the original source __[found here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__ and load it using pandas __[(Learn How)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)__ or you can use PyCaret's data respository to load the data using get_data function (This will require internet connection)." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " " - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from pycaret.datasets import get_data\n", - "dataset = get_data('credit', profile=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that when `profile` parameter is to `True`, it displays data profile for Exploratory Data Analysis. Several pre-processing steps as discussed in section 2 above is performed in this experiment based on this analysis. Let's summarize how this analysis have helped us making critical choices in pre-processing the data.\n", - "\n", - "- **Missing Values:** There are no missing values in the data. However, we still need imputers in our pipeline just in case the new unseen data has missing values (not applicable in this case). When you execute `setup()` function, imputers are created and stored in pipeline automatically. By default, it uses mean imputer for numeric values and constant imputer for categorical but can be changed using `numeric_imputation` and `categorical_imputation` parameter in `setup()`.
\n", - "
\n", - "- **Multicollinearity:** There are high correlations between several `BILL_AMT1 ... BIL_AMT6` which introduces multicollinearity in the data. We will remove multi-collinearity by using `remove_multicollinearity` and `multicollinearity_threshold` parameter in setup.
\n", - "
\n", - "- **Data Scale / Range:** Notice that how the scale / range of numeric features are different. For example feature `AGE` ranges between 21 to 79 and `BILL_AMT1` ranges from -165,580 to 964,511. This may cause problem for algorithms that assumes that all features have variance in the same order. In this case, order of magnitude in change for `BILL_AMT1` is widely different than `AGE`. We will deal with this problem by using `normalize` parameter in setup.
\n", - "
\n", - "- **Distribution of Feature Space:** Numeric features are not normally distributed. Look at the distribution of `LIMIT_BAL`, `BILL_AMT1` and `PAY_AMT1 ... PAY_AMT6`. Few features are also highly skewed for e.g. `PAY_AMT1`. This may cause problem for algorithms that assumes normal or approximate normal distribution of the data for example Logistic Regression, Linear Discriminant Analysis (LDA) and Naive Bayes. We will deal with this problem by using `transformation` parameter in setup.
\n", - "
\n", - "- **Group Features:** From the data description we know certain features are related with each other. For example `BILL_AMT1 ... BILL_AMT6` is related. Similarly, `PAY_AMT1 ... PAY_AMT6` is related. We will use `group_features` parameter in setup to extract statistical information from these features.
\n", - "
\n", - "- **Bin Numeric Features:** Looking at the correlation between numeric features with target variable, we see a very weak relation of `AGE` and `LIMIT_BAL` with target variable. We will use `bin_numeric_features` parameter to remove the noise from these variables. This may help linear algorithms.
" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(24000, 24)" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#check the shape of data\n", - "dataset.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to demonstrate the `predict_model()` function on unseen data, a sample of 1200 rows are taken out from original dataset to be used for predictions. This should not be confused with train/test split. This split is performed to simulate real life scenario. Another way to think about this is that these 1200 records were not available at the time when machine learning experiment was performed." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Data for Modeling: (22800, 24)\n", - "Unseen Data For Predictions (1200, 24)\n" - ] - } - ], - "source": [ - "data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)\n", - "data_unseen = dataset.drop(data.index).reset_index(drop=True)\n", - "\n", - "print('Data for Modeling: ' + str(data.shape))\n", - "print('Unseen Data For Predictions ' + str(data_unseen.shape))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 6.0 Setting up Environment in PyCaret" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In previous tutorial __[Binary Classification (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have learned how to initializes the environment in pycaret using `setup()`. You would remember that we have not passed any additional parameters in our last example as we didn't performed any pre-processing (other than those that are imperative for machine learning experiments and they are performed automatically by PyCaret). In this example we will take it to the next level by customizing the pre-processing pipeline using `setup()`. Let's see how to implement all the steps discussed in section 4 above." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "from pycaret.classification import *" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " \n", - "Setup Succesfully Completed!\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Description Value
0session_id123
1Target TypeBinary
2Label EncodedNone
3Original Data(22800, 24)
4Missing Values False
5Numeric Features 14
6Categorical Features 9
7Ordinal Features False
8High Cardinality Features False
9High Cardinality Method None
10Sampled Data(22800, 24)
11Transformed Train Set(15959, 125)
12Transformed Test Set(6841, 125)
13Numeric Imputer mean
14Categorical Imputer constant
15Normalize True
16Normalize Method zscore
17Transformation True
18Transformation Method yeo-johnson
19PCA False
20PCA Method None
21PCA Components None
22Ignore Low Variance True
23Combine Rare Levels False
24Rare Level Threshold None
25Numeric Binning True
26Remove Outliers False
27Outliers Threshold None
28Remove Multicollinearity True
29Multicollinearity Threshold 0.95
30Clustering False
31Clustering Iteration None
32Polynomial Features False
33Polynomial Degree None
34Trignometry Features False
35Polynomial Threshold None
36Group Features True
37Feature Selection False
38Features Selection Threshold None
39Feature Interaction False
40Feature Ratio False
41Interaction Threshold None
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "exp_clf102 = setup(data = data, target = 'default', session_id=123,\n", - " normalize = True, \n", - " transformation = True, \n", - " ignore_low_variance = True,\n", - " remove_multicollinearity = True, multicollinearity_threshold = 0.95,\n", - " bin_numeric_features = ['LIMIT_BAL', 'AGE'],\n", - " group_features = [['BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'],\n", - " ['PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that it's the same setup grid that was shown in __[Binary Classification Tutorial (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__. The only difference here is the customization that we have performed in `setup()` is now set to `True`. Also notice that `session_id` is same `123` as beginner's level, which means effect of randomization is completely isolated. Any improvements we see in this experiment is solely due to the pre-processing steps taken in `setup()` or any other modeling techniques we use in later sections of this tutorial." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 7.0 Comparing All Models" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similar to __[Binary Classification Tutorial (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we will also begin this tutorial with `compare_models()`. We will then compare the below results with the last experiment." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Model Accuracy AUC Recall Prec. F1 Kappa
0Gradient Boosting Classifier0.82370.78740.3680.69180.48010.3857
1Ridge Classifier0.823600.36460.6930.47760.3836
2Extreme Gradient Boosting0.82320.78530.36180.6930.47510.3812
3Linear Discriminant Analysis0.82240.7780.37960.67620.4860.3887
4SVM - Linear Kernel0.82200.33430.70830.45390.3635
5Light Gradient Boosting Machine0.8220.78290.38190.67240.48690.3889
6Logistic Regression0.82170.78040.35920.68560.47120.3763
7CatBoost Classifier0.8210.78590.37960.6690.48420.3857
8Ada Boost Classifier0.81910.78220.3380.68520.45240.3587
9Extra Trees Classifier0.8170.76340.38240.64670.48030.378
10Random Forest Classifier0.80910.7350.33260.63020.43510.3332
11Naive Bayes0.80440.7530.24990.65870.35790.2701
12K Neighbors Classifier0.78960.70050.3280.54050.40810.2894
13Decision Tree Classifier0.73120.62250.42580.39940.4120.2381
14Quadratic Discriminant Analysis0.72940.74190.26490.56750.24150.1379
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "compare_models()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For the purpose of comparison we will use `AUC` this time. Notice that how drastically few algorithms have improved after we have performed few pre-processing steps in `setup()`. \n", - "- Logistic Regression AUC improved from `0.6508` to `0.7804`\n", - "- Naives Bayes AUC improved from `0.6457` to `0.7530`\n", - "- K Nearest Neighbors AUC improved from `0.6099` to `0.7005`\n", - "\n", - "To see results for all the models from previous tutorial refer to Section 7 in __[Binary Classification Tutorial (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 8.0 Create a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In previous tutorial __[Binary Classification (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have learned how to create a model using `create_model()` function. Now we will learn about few other parameters in `create_model()` that may come handy sometimes. In this section of the tutorial, we will create all models using 5 fold stratified cross validation, notice how `fold` parameter is passed inside `create_model()` to achieve this." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 8.1 Create Model (change fold to 5)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.72780.61950.41930.39210.40520.2289
10.73340.62300.42210.40220.41190.2396
20.73460.62310.42210.40430.41300.2417
30.73500.61430.39800.40030.39910.2291
40.72580.62340.43910.39290.41470.2364
Mean0.73130.62070.42010.39830.40880.2352
SD0.00380.00350.01310.00500.00580.0053
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7278 0.6195 0.4193 0.3921 0.4052 0.2289\n", - "1 0.7334 0.6230 0.4221 0.4022 0.4119 0.2396\n", - "2 0.7346 0.6231 0.4221 0.4043 0.4130 0.2417\n", - "3 0.7350 0.6143 0.3980 0.4003 0.3991 0.2291\n", - "4 0.7258 0.6234 0.4391 0.3929 0.4147 0.2364\n", - "Mean 0.7313 0.6207 0.4201 0.3983 0.4088 0.2352\n", - "SD 0.0038 0.0035 0.0131 0.0050 0.0058 0.0053" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "dt = create_model('dt', fold = 5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 8.2 Create Model (round to 2 decimals points)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.730.620.420.390.410.23
10.730.620.420.400.410.24
20.730.620.420.400.410.24
30.730.610.400.400.400.23
40.730.620.440.390.410.24
Mean0.730.620.420.400.410.24
SD0.000.000.010.000.010.01
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.73 0.62 0.42 0.39 0.41 0.23\n", - "1 0.73 0.62 0.42 0.40 0.41 0.24\n", - "2 0.73 0.62 0.42 0.40 0.41 0.24\n", - "3 0.73 0.61 0.40 0.40 0.40 0.23\n", - "4 0.73 0.62 0.44 0.39 0.41 0.24\n", - "Mean 0.73 0.62 0.42 0.40 0.41 0.24\n", - "SD 0.00 0.00 0.01 0.00 0.01 0.01" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "dt = create_model('dt', fold = 5, round = 2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how passing `round` parameter inside `create_model()` has rounded the evaluation metrics to 2 decimals. The score grid printed in 8.1 and 8.2 is exactly similar except that scores in 8.2 is rounded." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 9.0 Tune a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In previous tutorial __[Binary Classification (#CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have learned how to automatically tune hyperparameters of a model using pre-defined grids. In this tutorial we will introduce the use of `optimize` parameter in `tune_model()`. Optimize parameter can be thought of as an objective function. By default, in `pycaret.classification` all hyperparameter tuning is set to optimize `Accuracy` which can be changed using `optimize` parameter. See the example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.83150.78030.38530.72340.50280.4125
10.82080.79040.34840.68720.46240.3684
20.82080.81650.39380.65880.49290.3923
30.82330.76910.35690.69610.47190.3788
40.81580.77210.34840.65780.45560.3571
50.81270.78250.33140.65000.43900.3405
60.81330.76390.34840.64400.45220.3515
70.83400.78710.37110.75290.49720.4112
80.82140.76570.35690.68480.46930.3745
90.81070.78390.35130.62940.45090.3475
Mean0.82040.78110.35920.67840.46940.3734
SD0.00740.01460.01790.03630.02060.0242
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8315 0.7803 0.3853 0.7234 0.5028 0.4125\n", - "1 0.8208 0.7904 0.3484 0.6872 0.4624 0.3684\n", - "2 0.8208 0.8165 0.3938 0.6588 0.4929 0.3923\n", - "3 0.8233 0.7691 0.3569 0.6961 0.4719 0.3788\n", - "4 0.8158 0.7721 0.3484 0.6578 0.4556 0.3571\n", - "5 0.8127 0.7825 0.3314 0.6500 0.4390 0.3405\n", - "6 0.8133 0.7639 0.3484 0.6440 0.4522 0.3515\n", - "7 0.8340 0.7871 0.3711 0.7529 0.4972 0.4112\n", - "8 0.8214 0.7657 0.3569 0.6848 0.4693 0.3745\n", - "9 0.8107 0.7839 0.3513 0.6294 0.4509 0.3475\n", - "Mean 0.8204 0.7811 0.3592 0.6784 0.4694 0.3734\n", - "SD 0.0074 0.0146 0.0179 0.0363 0.0206 0.0242" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_rf = tune_model('rf')" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82710.78250.35410.72250.47530.3859
10.82270.79730.33990.70590.45890.3680
20.82270.82000.37390.68040.48260.3864
30.82020.76920.33990.68970.45540.3623
40.82080.77550.33710.69590.45420.3621
50.81700.78920.33430.67430.44700.3520
60.81520.75480.32860.66670.44020.3445
70.82830.78930.34280.74230.46900.3827
80.82080.76010.34840.68720.46240.3684
90.81250.78080.35130.63920.45340.3516
Mean0.82070.78190.34500.69040.45980.3664
SD0.00470.01790.01210.02750.01220.0141
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8271 0.7825 0.3541 0.7225 0.4753 0.3859\n", - "1 0.8227 0.7973 0.3399 0.7059 0.4589 0.3680\n", - "2 0.8227 0.8200 0.3739 0.6804 0.4826 0.3864\n", - "3 0.8202 0.7692 0.3399 0.6897 0.4554 0.3623\n", - "4 0.8208 0.7755 0.3371 0.6959 0.4542 0.3621\n", - "5 0.8170 0.7892 0.3343 0.6743 0.4470 0.3520\n", - "6 0.8152 0.7548 0.3286 0.6667 0.4402 0.3445\n", - "7 0.8283 0.7893 0.3428 0.7423 0.4690 0.3827\n", - "8 0.8208 0.7601 0.3484 0.6872 0.4624 0.3684\n", - "9 0.8125 0.7808 0.3513 0.6392 0.4534 0.3516\n", - "Mean 0.8207 0.7819 0.3450 0.6904 0.4598 0.3664\n", - "SD 0.0047 0.0179 0.0121 0.0275 0.0122 0.0141" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_rf2 = tune_model('rf', optimize = 'AUC')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that how the `optimize` parameter used in tuning Random Forest Classifier resulted in two different models. In `tuned_rf` where no optimize parameter was defined, it uses default function `Accuracy` to optimize and resulted in AUC of `0.7811`. In `tuned_rf2` we set `optimize` parameter to `AUC`, it resulted in AUC of `0.7819`. Notice the differences between hyperparameters of `tuned_rf` and `tuned_rf2` below:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Parameters
bootstrapTrue
ccp_alpha0
class_weightNone
criterionentropy
max_depth20
max_featuressqrt
max_leaf_nodesNone
max_samplesNone
min_impurity_decrease0
min_impurity_splitNone
min_samples_leaf4
min_samples_split5
min_weight_fraction_leaf0
n_estimators50
n_jobsNone
oob_scoreFalse
random_state123
verbose0
warm_startFalse
\n", - "
" - ], - "text/plain": [ - " Parameters\n", - "bootstrap True\n", - "ccp_alpha 0\n", - "class_weight None\n", - "criterion entropy\n", - "max_depth 20\n", - "max_features sqrt\n", - "max_leaf_nodes None\n", - "max_samples None\n", - "min_impurity_decrease 0\n", - "min_impurity_split None\n", - "min_samples_leaf 4\n", - "min_samples_split 5\n", - "min_weight_fraction_leaf 0\n", - "n_estimators 50\n", - "n_jobs None\n", - "oob_score False\n", - "random_state 123\n", - "verbose 0\n", - "warm_start False" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "#tuned_rf optimize parameter default to 'Accuracy'\n", - "plot_model(tuned_rf, plot = 'parameter')" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Parameters
bootstrapTrue
ccp_alpha0
class_weightNone
criteriongini
max_depth10
max_featuresauto
max_leaf_nodesNone
max_samplesNone
min_impurity_decrease0
min_impurity_splitNone
min_samples_leaf2
min_samples_split10
min_weight_fraction_leaf0
n_estimators70
n_jobsNone
oob_scoreFalse
random_state123
verbose0
warm_startFalse
\n", - "
" - ], - "text/plain": [ - " Parameters\n", - "bootstrap True\n", - "ccp_alpha 0\n", - "class_weight None\n", - "criterion gini\n", - "max_depth 10\n", - "max_features auto\n", - "max_leaf_nodes None\n", - "max_samples None\n", - "min_impurity_decrease 0\n", - "min_impurity_split None\n", - "min_samples_leaf 2\n", - "min_samples_split 10\n", - "min_weight_fraction_leaf 0\n", - "n_estimators 70\n", - "n_jobs None\n", - "oob_score False\n", - "random_state 123\n", - "verbose 0\n", - "warm_start False" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "#tuned_rf optimize parameter set to 'AUC'\n", - "plot_model(tuned_rf2, plot = 'parameter')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 10.0 Ensemble a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another common technique to improve performance of models is ensembling. Ensemble models in machine learning combine the decisions from multiple models to improve the overall performance. There are various techniques of ensembling that we will cover in this section. In order to ensemble models using `Bagging` or `Boosting` technique __[(Read More)](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205)__, we use `ensemble_model()` function in PyCaret. This function ensembles the trained base estimator using the method defined in `method` parameter." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.72370.61360.41640.38480.40000.2209
10.73310.62600.42490.40210.41320.2406
20.73180.62770.43910.40260.42010.2461
30.72620.61930.42490.39060.40710.2295
40.72990.60500.37960.38730.38340.2105
50.74560.64270.45890.42970.44380.2792
60.72060.62010.44190.38520.41160.2295
70.74250.63190.43340.42030.42680.2608
80.73750.62020.40790.40680.40740.2387
90.72160.61840.43060.38480.40640.2253
Mean0.73120.62250.42580.39940.41200.2381
SD0.00820.00980.02040.01510.01530.0191
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7237 0.6136 0.4164 0.3848 0.4000 0.2209\n", - "1 0.7331 0.6260 0.4249 0.4021 0.4132 0.2406\n", - "2 0.7318 0.6277 0.4391 0.4026 0.4201 0.2461\n", - "3 0.7262 0.6193 0.4249 0.3906 0.4071 0.2295\n", - "4 0.7299 0.6050 0.3796 0.3873 0.3834 0.2105\n", - "5 0.7456 0.6427 0.4589 0.4297 0.4438 0.2792\n", - "6 0.7206 0.6201 0.4419 0.3852 0.4116 0.2295\n", - "7 0.7425 0.6319 0.4334 0.4203 0.4268 0.2608\n", - "8 0.7375 0.6202 0.4079 0.4068 0.4074 0.2387\n", - "9 0.7216 0.6184 0.4306 0.3848 0.4064 0.2253\n", - "Mean 0.7312 0.6225 0.4258 0.3994 0.4120 0.2381\n", - "SD 0.0082 0.0098 0.0204 0.0151 0.0153 0.0191" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# lets create a simple decision tree model that we will use for ensembling \n", - "dt = create_model('dt')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.1 Bagging" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.81140.74230.36830.62500.46350.3582
10.80640.73180.33710.61340.43510.3300
20.81830.75990.40230.64250.49480.3911
30.80010.71810.32580.58670.41890.3100
40.81330.72490.34280.64710.44810.3483
50.81330.73670.35410.64100.45620.3546
60.79640.71710.30880.57370.40150.2919
70.81700.74970.35130.66310.45930.3614
80.81390.72400.36260.64000.46290.3606
90.81000.74500.36830.61900.46180.3554
Mean0.81000.73490.35210.62520.45020.3461
SD0.00670.01350.02460.02640.02480.0270
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8114 0.7423 0.3683 0.6250 0.4635 0.3582\n", - "1 0.8064 0.7318 0.3371 0.6134 0.4351 0.3300\n", - "2 0.8183 0.7599 0.4023 0.6425 0.4948 0.3911\n", - "3 0.8001 0.7181 0.3258 0.5867 0.4189 0.3100\n", - "4 0.8133 0.7249 0.3428 0.6471 0.4481 0.3483\n", - "5 0.8133 0.7367 0.3541 0.6410 0.4562 0.3546\n", - "6 0.7964 0.7171 0.3088 0.5737 0.4015 0.2919\n", - "7 0.8170 0.7497 0.3513 0.6631 0.4593 0.3614\n", - "8 0.8139 0.7240 0.3626 0.6400 0.4629 0.3606\n", - "9 0.8100 0.7450 0.3683 0.6190 0.4618 0.3554\n", - "Mean 0.8100 0.7349 0.3521 0.6252 0.4502 0.3461\n", - "SD 0.0067 0.0135 0.0246 0.0264 0.0248 0.0270" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "bagged_dt = ensemble_model(dt)" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,\n", - " class_weight=None,\n", - " criterion='gini',\n", - " max_depth=None,\n", - " max_features=None,\n", - " max_leaf_nodes=None,\n", - " min_impurity_decrease=0.0,\n", - " min_impurity_split=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=2,\n", - " min_weight_fraction_leaf=0.0,\n", - " presort='deprecated',\n", - " random_state=123,\n", - " splitter='best'),\n", - " bootstrap=True, bootstrap_features=False, max_features=1.0,\n", - " max_samples=1.0, n_estimators=10, n_jobs=None,\n", - " oob_score=False, random_state=123, verbose=0,\n", - " warm_start=False)\n" - ] - } - ], - "source": [ - "# check the parameter of bagged_dt\n", - "print(bagged_dt)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how ensembling has improved the `AUC` from `0.6225` to `0.7349`. In above example we have used all default parameters of `ensemble_model()` which uses `Bagging` method. Let's try `Boosting` by changing the `method` parameter in `ensemble_model()`. See example below: " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.2 Boosting" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.78380.70710.39090.51490.44440.3134
10.77820.68080.32860.49790.39590.2670
20.77820.71860.39940.49820.44340.3070
30.78260.68890.37680.51150.43390.3032
40.77820.65210.28610.49750.36330.2407
50.78320.69590.39660.51280.44730.3152
60.78010.68810.36260.50390.42170.2904
70.78450.66340.35980.51840.42470.2974
80.77260.68890.40230.48300.43890.2978
90.77620.71120.36260.49230.41760.2830
Mean0.77970.68950.36660.50310.42310.2915
SD0.00360.01960.03460.01070.02480.0218
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7838 0.7071 0.3909 0.5149 0.4444 0.3134\n", - "1 0.7782 0.6808 0.3286 0.4979 0.3959 0.2670\n", - "2 0.7782 0.7186 0.3994 0.4982 0.4434 0.3070\n", - "3 0.7826 0.6889 0.3768 0.5115 0.4339 0.3032\n", - "4 0.7782 0.6521 0.2861 0.4975 0.3633 0.2407\n", - "5 0.7832 0.6959 0.3966 0.5128 0.4473 0.3152\n", - "6 0.7801 0.6881 0.3626 0.5039 0.4217 0.2904\n", - "7 0.7845 0.6634 0.3598 0.5184 0.4247 0.2974\n", - "8 0.7726 0.6889 0.4023 0.4830 0.4389 0.2978\n", - "9 0.7762 0.7112 0.3626 0.4923 0.4176 0.2830\n", - "Mean 0.7797 0.6895 0.3666 0.5031 0.4231 0.2915\n", - "SD 0.0036 0.0196 0.0346 0.0107 0.0248 0.0218" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "boosted_dt = ensemble_model(dt, method = 'Boosting')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how easy it is to ensemble models in PyCaret. Just by changing `method` parameter you can do Bagging or Boosting which otherwise would have taken multiple lines of code. Note that `ensemble_model()` by default build `10` estimators, however, this can be changed using `n_estimators` parameter inside `ensemble_model()`. Increase `n_estimators` sometimes improves the result. See an example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82140.75880.39660.66040.49560.3952
10.82080.76340.36830.67360.47620.3791
20.82460.78590.41640.66520.51220.4121
30.81080.74010.36260.62440.45880.3538
40.81450.75750.36260.64320.46380.3620
50.82640.76890.39380.68810.50090.4051
60.81330.73480.37680.63030.47160.3669
70.82580.76730.38810.68840.49640.4008
80.81700.74770.37390.65020.47480.3737
90.81760.76110.41080.63600.49910.3939
Mean0.81920.75850.38500.65600.48490.3842
SD0.00520.01410.01830.02180.01710.0189
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8214 0.7588 0.3966 0.6604 0.4956 0.3952\n", - "1 0.8208 0.7634 0.3683 0.6736 0.4762 0.3791\n", - "2 0.8246 0.7859 0.4164 0.6652 0.5122 0.4121\n", - "3 0.8108 0.7401 0.3626 0.6244 0.4588 0.3538\n", - "4 0.8145 0.7575 0.3626 0.6432 0.4638 0.3620\n", - "5 0.8264 0.7689 0.3938 0.6881 0.5009 0.4051\n", - "6 0.8133 0.7348 0.3768 0.6303 0.4716 0.3669\n", - "7 0.8258 0.7673 0.3881 0.6884 0.4964 0.4008\n", - "8 0.8170 0.7477 0.3739 0.6502 0.4748 0.3737\n", - "9 0.8176 0.7611 0.4108 0.6360 0.4991 0.3939\n", - "Mean 0.8192 0.7585 0.3850 0.6560 0.4849 0.3842\n", - "SD 0.0052 0.0141 0.0183 0.0218 0.0171 0.0189" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "bagged_dt2 = ensemble_model(dt, n_estimators=50)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice how increasing the n_estimators parameter has improved the result. AUC in bagged_dt2 where `n_estimators = 50` improved to `0.7585` from `0.7349` in bagged_dt where n_estimators uses default value of `10`. You can also use `tune_model()` function to automatically tune `n_estimators` parameter of ensemble. See example below in which we are tuning Decision Tree along with `Bagging` parameters of Ensembled Decision Tree." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82330.77830.33710.71260.45770.3679
10.82270.79270.34840.69890.46500.3727
20.81950.81450.35410.67570.46470.3686
30.82140.76640.35410.68680.46730.3729
40.82640.77610.35690.71590.47640.3860
50.82520.78940.35130.71260.47060.3800
60.81450.75010.32010.66860.43300.3382
70.82890.78380.33430.75640.46370.3795
80.81830.75690.33140.68420.44660.3532
90.81440.77410.34840.65080.45390.3542
Mean0.82150.77820.34360.69630.45990.3673
SD0.00460.01760.01150.02830.01200.0139
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8233 0.7783 0.3371 0.7126 0.4577 0.3679\n", - "1 0.8227 0.7927 0.3484 0.6989 0.4650 0.3727\n", - "2 0.8195 0.8145 0.3541 0.6757 0.4647 0.3686\n", - "3 0.8214 0.7664 0.3541 0.6868 0.4673 0.3729\n", - "4 0.8264 0.7761 0.3569 0.7159 0.4764 0.3860\n", - "5 0.8252 0.7894 0.3513 0.7126 0.4706 0.3800\n", - "6 0.8145 0.7501 0.3201 0.6686 0.4330 0.3382\n", - "7 0.8289 0.7838 0.3343 0.7564 0.4637 0.3795\n", - "8 0.8183 0.7569 0.3314 0.6842 0.4466 0.3532\n", - "9 0.8144 0.7741 0.3484 0.6508 0.4539 0.3542\n", - "Mean 0.8215 0.7782 0.3436 0.6963 0.4599 0.3673\n", - "SD 0.0046 0.0176 0.0115 0.0283 0.0120 0.0139" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_bagged_dt = tune_model('dt', ensemble = True, method='Bagging')" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,\n", - " class_weight=None,\n", - " criterion='gini',\n", - " max_depth=6,\n", - " max_features=None,\n", - " max_leaf_nodes=None,\n", - " min_impurity_decrease=0.0,\n", - " min_impurity_split=None,\n", - " min_samples_leaf=2,\n", - " min_samples_split=2,\n", - " min_weight_fraction_leaf=0.0,\n", - " presort='deprecated',\n", - " random_state=123,\n", - " splitter='best'),\n", - " bootstrap=False, bootstrap_features=True, max_features=1.0,\n", - " max_samples=1.0, n_estimators=80, n_jobs=None,\n", - " oob_score=False, random_state=123, verbose=0,\n", - " warm_start=False)\n" - ] - } - ], - "source": [ - "# check the parameters of tuned Decision Tree with bagging\n", - "print(tuned_bagged_dt)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that `tuned_bagged_dt` is a Decision Tree wrapped inside a `BaggingClassifier`. Our first bagging ensemble with default values stored in `bagged_dt` resulted in AUC of `0.7349` which was improved to `0.7585` when we increased the `n_estimators` parameter to `50`. When we have tuned Decision Tree with `ensemble = True` inside `tune_model()` function, our final AUC for Tuned Ensembled Decision Tree is `0.7782`." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.3 Blending" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Blending is another common way to use ensembling technique in PyCaret. It uses predictions from multiple models to generate final prediction using voting / majority consensus from all the models passed in `estimator_list` parameter. If no list is passed, by default PyCaret uses all the models available in model library. `method` parameter can be used to define type of voting. When method is set to `hard`, it uses labels for majority rule voting. When method is set to `soft` it uses sum of predicted probabilities instead of label. Let's see an example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82520.00.36830.69890.48240.3891
10.82460.00.34560.71350.46560.3755
20.82330.00.37960.68020.48730.3907
30.81770.00.34840.66850.45810.3613
40.82640.00.34840.72350.47040.3814
50.82390.00.35690.70000.47280.3802
60.81830.00.33990.67800.45280.3580
70.82710.00.34840.72780.47130.3829
80.82140.00.34840.69100.46330.3698
90.81760.00.36540.65820.46990.3705
Mean0.82250.00.35500.69400.46940.3759
SD0.00340.00.01180.02200.00980.0105
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8252 0.0 0.3683 0.6989 0.4824 0.3891\n", - "1 0.8246 0.0 0.3456 0.7135 0.4656 0.3755\n", - "2 0.8233 0.0 0.3796 0.6802 0.4873 0.3907\n", - "3 0.8177 0.0 0.3484 0.6685 0.4581 0.3613\n", - "4 0.8264 0.0 0.3484 0.7235 0.4704 0.3814\n", - "5 0.8239 0.0 0.3569 0.7000 0.4728 0.3802\n", - "6 0.8183 0.0 0.3399 0.6780 0.4528 0.3580\n", - "7 0.8271 0.0 0.3484 0.7278 0.4713 0.3829\n", - "8 0.8214 0.0 0.3484 0.6910 0.4633 0.3698\n", - "9 0.8176 0.0 0.3654 0.6582 0.4699 0.3705\n", - "Mean 0.8225 0.0 0.3550 0.6940 0.4694 0.3759\n", - "SD 0.0034 0.0 0.0118 0.0220 0.0098 0.0105" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "blend_hard = blend_models()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have created a Voting Classifier using `blend_models()` function. The model stored in variable `blend_hard` is just like any other model that you would create using `create_model()` or `tune_model()`. You can use this model for prediction on unseen data using `predict_model()` in the same way you would use any other model. Notice that since we didn't pass the list of specific models for voting, it by default uses all the models in model library. \n", - "\n", - "You may have noticed that `AUC` is zero for all folds. This is because `method` parameter is set to `hard` which only uses labels (1 or 0) for predictions and hence no AUC is calculated. To change this you can use `method` parameter inside `blend_models()`. See example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82270.77350.41080.65910.50610.4051
10.82210.78940.32290.71700.44530.3570
20.82270.80340.43630.64710.52120.4174
30.81830.76430.32290.69090.44020.3483
40.82330.76520.31440.73510.44050.3550
50.82580.78930.34560.72190.46740.3784
60.81770.76070.32290.68670.43930.3469
70.82580.79000.35410.71430.47350.3830
80.82080.76090.33140.70060.45000.3589
90.81130.76570.34280.63680.44570.3441
Mean0.82100.77630.35040.69100.46290.3694
SD0.00410.01460.03880.03180.02780.0243
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8227 0.7735 0.4108 0.6591 0.5061 0.4051\n", - "1 0.8221 0.7894 0.3229 0.7170 0.4453 0.3570\n", - "2 0.8227 0.8034 0.4363 0.6471 0.5212 0.4174\n", - "3 0.8183 0.7643 0.3229 0.6909 0.4402 0.3483\n", - "4 0.8233 0.7652 0.3144 0.7351 0.4405 0.3550\n", - "5 0.8258 0.7893 0.3456 0.7219 0.4674 0.3784\n", - "6 0.8177 0.7607 0.3229 0.6867 0.4393 0.3469\n", - "7 0.8258 0.7900 0.3541 0.7143 0.4735 0.3830\n", - "8 0.8208 0.7609 0.3314 0.7006 0.4500 0.3589\n", - "9 0.8113 0.7657 0.3428 0.6368 0.4457 0.3441\n", - "Mean 0.8210 0.7763 0.3504 0.6910 0.4629 0.3694\n", - "SD 0.0041 0.0146 0.0388 0.0318 0.0278 0.0243" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "blend_soft = blend_models(method = 'soft')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that results are not much different. Both the examples above uses all the models in model library to create Voting Classifier. Now we will create specific models using `create_model()` and use them inside `blend_models()` to create Voting Classifier of hand picked models passed in `estimator_list` parameter. See example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "\"\"\"\n", - "we will create 4 specific models to be passed into blend_models().\n", - "Note that verbose is set to False to avoid printing score grid of individual models.\n", - "\"\"\"\n", - "\n", - "gbc = create_model('gbc', verbose = False)\n", - "dt = create_model('dt', verbose = False)\n", - "lightgbm = create_model('lightgbm', verbose = False)\n", - "xgboost = create_model('xgboost', verbose = False)" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.81270.76640.38530.62390.47640.3700
10.81450.77930.37680.63640.47330.3696
20.81450.80500.41080.62230.49490.3871
30.81270.75650.37110.62980.46700.3625
40.81890.76150.35980.66840.46780.3703
50.81700.78380.38530.64450.48230.3796
60.81390.75370.39380.62610.48350.3771
70.82890.78080.38530.70830.49910.4066
80.81580.74830.37680.64250.47500.3724
90.80940.76360.38530.60990.47220.3631
Mean0.81580.76990.38300.64120.47910.3758
SD0.00500.01630.01290.02700.01030.0124
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8127 0.7664 0.3853 0.6239 0.4764 0.3700\n", - "1 0.8145 0.7793 0.3768 0.6364 0.4733 0.3696\n", - "2 0.8145 0.8050 0.4108 0.6223 0.4949 0.3871\n", - "3 0.8127 0.7565 0.3711 0.6298 0.4670 0.3625\n", - "4 0.8189 0.7615 0.3598 0.6684 0.4678 0.3703\n", - "5 0.8170 0.7838 0.3853 0.6445 0.4823 0.3796\n", - "6 0.8139 0.7537 0.3938 0.6261 0.4835 0.3771\n", - "7 0.8289 0.7808 0.3853 0.7083 0.4991 0.4066\n", - "8 0.8158 0.7483 0.3768 0.6425 0.4750 0.3724\n", - "9 0.8094 0.7636 0.3853 0.6099 0.4722 0.3631\n", - "Mean 0.8158 0.7699 0.3830 0.6412 0.4791 0.3758\n", - "SD 0.0050 0.0163 0.0129 0.0270 0.0103 0.0124" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "blend_specific_soft = blend_models(estimator_list = [gbc,dt,lightgbm,xgboost], method = 'soft')" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.83080.00.36830.73450.49060.4023
10.82520.00.34840.71510.46860.3785
20.82140.00.36260.68090.47320.3775
30.82210.00.35410.69060.46820.3744
40.82710.00.35130.72510.47330.3844
50.82520.00.36260.70330.47850.3861
60.82210.00.35130.69270.46620.3728
70.83150.00.35410.75300.48170.3963
80.81830.00.32860.68640.44440.3516
90.81690.00.36830.65330.47100.3706
Mean0.82400.00.35500.70350.47160.3794
SD0.00460.00.01110.02760.01140.0134
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8308 0.0 0.3683 0.7345 0.4906 0.4023\n", - "1 0.8252 0.0 0.3484 0.7151 0.4686 0.3785\n", - "2 0.8214 0.0 0.3626 0.6809 0.4732 0.3775\n", - "3 0.8221 0.0 0.3541 0.6906 0.4682 0.3744\n", - "4 0.8271 0.0 0.3513 0.7251 0.4733 0.3844\n", - "5 0.8252 0.0 0.3626 0.7033 0.4785 0.3861\n", - "6 0.8221 0.0 0.3513 0.6927 0.4662 0.3728\n", - "7 0.8315 0.0 0.3541 0.7530 0.4817 0.3963\n", - "8 0.8183 0.0 0.3286 0.6864 0.4444 0.3516\n", - "9 0.8169 0.0 0.3683 0.6533 0.4710 0.3706\n", - "Mean 0.8240 0.0 0.3550 0.7035 0.4716 0.3794\n", - "SD 0.0046 0.0 0.0111 0.0276 0.0114 0.0134" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "blend_specific_hard = blend_models(estimator_list = [gbc,dt,lightgbm,xgboost], method = 'hard')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that Accuracy using `hard` method has improved to `0.8240` in blend_specific_hard from `0.8158` in blend_specific_soft. Which `method` and `models` to use in blending depends on the statistical properties of the dataset and experimenting with different models and method is the best way to find out which blender will work best on the specific problem." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.4 Stacking" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Stacking is another popular yet less commonly implemented technique of ensembling due to practical difficulties. Stacking is an ensemble learning technique that combines multiple models via a meta-model. The another way to think about stacking is that multiple models are trained to predict the outcome and a meta-model is created that uses the prediction from all those models as input along with original features. The implementation of `stack_models()` is based on Wolpert, D. H. (1992b). Stacked generalization __[(Read More)](https://www.sciencedirect.com/science/article/abs/pii/S0893608005800231)__. \n", - "\n", - "Let's see an example below using the models we have created in section 10.3 above." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82520.78350.37680.69270.48810.3936
10.82210.79440.35980.68650.47210.3774
20.82390.82240.39090.67650.49550.3980
30.82270.78490.37680.67860.48450.3878
40.82460.78030.37110.69310.48340.3892
50.81770.79730.36260.65980.46800.3690
60.81890.75250.36260.66670.46970.3718
70.83020.78920.37680.72280.49530.4052
80.81950.75270.35130.67760.46270.3671
90.81130.77800.37390.62260.46730.3612
Mean0.82160.78350.37030.67770.47870.3820
SD0.00490.01950.01080.02460.01150.0140
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8252 0.7835 0.3768 0.6927 0.4881 0.3936\n", - "1 0.8221 0.7944 0.3598 0.6865 0.4721 0.3774\n", - "2 0.8239 0.8224 0.3909 0.6765 0.4955 0.3980\n", - "3 0.8227 0.7849 0.3768 0.6786 0.4845 0.3878\n", - "4 0.8246 0.7803 0.3711 0.6931 0.4834 0.3892\n", - "5 0.8177 0.7973 0.3626 0.6598 0.4680 0.3690\n", - "6 0.8189 0.7525 0.3626 0.6667 0.4697 0.3718\n", - "7 0.8302 0.7892 0.3768 0.7228 0.4953 0.4052\n", - "8 0.8195 0.7527 0.3513 0.6776 0.4627 0.3671\n", - "9 0.8113 0.7780 0.3739 0.6226 0.4673 0.3612\n", - "Mean 0.8216 0.7835 0.3703 0.6777 0.4787 0.3820\n", - "SD 0.0049 0.0195 0.0108 0.0246 0.0115 0.0140" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "stack_soft = stack_models([gbc,dt,lightgbm,xgboost])" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82460.77930.36540.69730.47960.3862
10.82140.78920.34560.69320.46120.3683
20.82640.81690.38810.69190.49730.4023
30.81890.78550.35980.66840.46780.3703
40.82520.77570.36260.70330.47850.3861
50.81830.79800.35410.66840.46300.3658
60.82080.74630.34560.68930.46040.3668
70.83020.78670.36830.73030.48960.4008
80.82460.75210.35130.70860.46970.3786
90.81190.77190.35980.63180.45850.3549
Mean0.82220.78010.36010.68830.47260.3780
SD0.00490.01960.01190.02550.01260.0148
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8246 0.7793 0.3654 0.6973 0.4796 0.3862\n", - "1 0.8214 0.7892 0.3456 0.6932 0.4612 0.3683\n", - "2 0.8264 0.8169 0.3881 0.6919 0.4973 0.4023\n", - "3 0.8189 0.7855 0.3598 0.6684 0.4678 0.3703\n", - "4 0.8252 0.7757 0.3626 0.7033 0.4785 0.3861\n", - "5 0.8183 0.7980 0.3541 0.6684 0.4630 0.3658\n", - "6 0.8208 0.7463 0.3456 0.6893 0.4604 0.3668\n", - "7 0.8302 0.7867 0.3683 0.7303 0.4896 0.4008\n", - "8 0.8246 0.7521 0.3513 0.7086 0.4697 0.3786\n", - "9 0.8119 0.7719 0.3598 0.6318 0.4585 0.3549\n", - "Mean 0.8222 0.7801 0.3601 0.6883 0.4726 0.3780\n", - "SD 0.0049 0.0196 0.0119 0.0255 0.0126 0.0148" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "stack_hard = stack_models([gbc,dt,lightgbm,xgboost], method='hard')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similar to blending `stack_models()` also support soft and hard method that can be defined under `method` parameter. Soft method uses sum of predicted probabilities and hard method uses label (1 or 0). In both the examples above the meta model (final model to generate predictions) is Logistic Regression (by default). Meta model can be changed using `meta_model` parameter. See an example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.83020.78280.38240.71810.49910.4081
10.82580.79090.35130.71680.47150.3815
20.81770.82100.36540.65820.46990.3705
30.81950.77110.35980.67200.46860.3717
40.82460.77770.35980.70170.47570.3832
50.81950.78460.36540.66840.47250.3748
60.81830.76720.36260.66320.46890.3704
70.83770.79380.38810.76110.51410.4287
80.81700.76250.35690.65970.46320.3645
90.81570.78600.37680.64250.47500.3723
Mean0.82260.78380.36690.68620.47780.3826
SD0.00660.01570.01120.03500.01510.0192
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8302 0.7828 0.3824 0.7181 0.4991 0.4081\n", - "1 0.8258 0.7909 0.3513 0.7168 0.4715 0.3815\n", - "2 0.8177 0.8210 0.3654 0.6582 0.4699 0.3705\n", - "3 0.8195 0.7711 0.3598 0.6720 0.4686 0.3717\n", - "4 0.8246 0.7777 0.3598 0.7017 0.4757 0.3832\n", - "5 0.8195 0.7846 0.3654 0.6684 0.4725 0.3748\n", - "6 0.8183 0.7672 0.3626 0.6632 0.4689 0.3704\n", - "7 0.8377 0.7938 0.3881 0.7611 0.5141 0.4287\n", - "8 0.8170 0.7625 0.3569 0.6597 0.4632 0.3645\n", - "9 0.8157 0.7860 0.3768 0.6425 0.4750 0.3723\n", - "Mean 0.8226 0.7838 0.3669 0.6862 0.4778 0.3826\n", - "SD 0.0066 0.0157 0.0112 0.0350 0.0151 0.0192" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "stack_soft2 = stack_models([gbc,dt,lightgbm], meta_model=xgboost)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Which `method` and `models` to use in stacking depends on the statistical properties of the dataset and experimenting with different models and method is the best way to find out which configuration will work best. However, as a general rule of thumb, models with strong yet diverse performance are tend to improve performance when used in stacking. One way to measure diversity is the correlation of prediction between models. You can analyze that using `plot` parameter. See an example below: " - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82520.78350.37680.69270.48810.3936
10.82210.79440.35980.68650.47210.3774
20.82390.82240.39090.67650.49550.3980
30.82270.78490.37680.67860.48450.3878
40.82460.78030.37110.69310.48340.3892
50.81770.79730.36260.65980.46800.3690
60.81890.75250.36260.66670.46970.3718
70.83020.78920.37680.72280.49530.4052
80.81950.75270.35130.67760.46270.3671
90.81130.77800.37390.62260.46730.3612
Mean0.82160.78350.37030.67770.47870.3820
SD0.00490.01950.01080.02460.01150.0140
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8252 0.7835 0.3768 0.6927 0.4881 0.3936\n", - "1 0.8221 0.7944 0.3598 0.6865 0.4721 0.3774\n", - "2 0.8239 0.8224 0.3909 0.6765 0.4955 0.3980\n", - "3 0.8227 0.7849 0.3768 0.6786 0.4845 0.3878\n", - "4 0.8246 0.7803 0.3711 0.6931 0.4834 0.3892\n", - "5 0.8177 0.7973 0.3626 0.6598 0.4680 0.3690\n", - "6 0.8189 0.7525 0.3626 0.6667 0.4697 0.3718\n", - "7 0.8302 0.7892 0.3768 0.7228 0.4953 0.4052\n", - "8 0.8195 0.7527 0.3513 0.6776 0.4627 0.3671\n", - "9 0.8113 0.7780 0.3739 0.6226 0.4673 0.3612\n", - "Mean 0.8216 0.7835 0.3703 0.6777 0.4787 0.3820\n", - "SD 0.0049 0.0195 0.0108 0.0246 0.0115 0.0140" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "stack_soft_plot = stack_models([gbc,dt,lightgbm,xgboost], plot=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before we wrap up this section, there is another parameter in `stack_models()` that we haven't seen yet i.e. `restack` parameter when is set to True by default. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 11.0 Model Calibration" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When performing classification you often want not only to predict the class label (outcome such as 0 or 1), but also obtain a probability of the respective outcome. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not support probability prediction. Well calibrated classifiers are probabilistic classifiers for which the output in form of probabilities can be directly interpreted as a confidence level. PyCaret allows you to calibrate the probabilities of a given model through `calibrate_model()` function. See an example below:" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82020.73700.35410.67930.46550.3701
10.81140.74640.31160.65480.42230.3261
20.81020.75080.35980.62250.45600.3508
30.80390.73110.30880.61240.41050.3079
40.80890.71300.33140.62900.43410.3322
50.81200.73450.32290.65140.43180.3342
60.80330.73140.33140.60000.42700.3200
70.80330.75060.32290.60320.42070.3150
80.81140.72310.33710.63980.44160.3410
90.80630.73230.34560.61000.44120.3347
Mean0.80910.73500.33260.63020.43510.3332
SD0.00500.01130.01610.02450.01580.0171
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8202 0.7370 0.3541 0.6793 0.4655 0.3701\n", - "1 0.8114 0.7464 0.3116 0.6548 0.4223 0.3261\n", - "2 0.8102 0.7508 0.3598 0.6225 0.4560 0.3508\n", - "3 0.8039 0.7311 0.3088 0.6124 0.4105 0.3079\n", - "4 0.8089 0.7130 0.3314 0.6290 0.4341 0.3322\n", - "5 0.8120 0.7345 0.3229 0.6514 0.4318 0.3342\n", - "6 0.8033 0.7314 0.3314 0.6000 0.4270 0.3200\n", - "7 0.8033 0.7506 0.3229 0.6032 0.4207 0.3150\n", - "8 0.8114 0.7231 0.3371 0.6398 0.4416 0.3410\n", - "9 0.8063 0.7323 0.3456 0.6100 0.4412 0.3347\n", - "Mean 0.8091 0.7350 0.3326 0.6302 0.4351 0.3332\n", - "SD 0.0050 0.0113 0.0161 0.0245 0.0158 0.0171" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "rf = create_model('rf')" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(rf, plot='calibration')" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.81890.77080.27200.75000.39920.3190
10.81700.77660.28330.71940.40650.3217
20.81640.80620.26910.73080.39340.3114
30.81270.74480.27760.69010.39600.3082
40.81270.75800.27760.69010.39600.3082
50.81520.78300.27200.71640.39430.3103
60.81520.74950.28900.69860.40880.3209
70.82020.78820.26350.77500.39320.3165
80.81520.75900.28330.70420.40400.3174
90.81130.77010.27200.68570.38950.3017
Mean0.81550.77060.27590.71600.39810.3135
SD0.00270.01780.00730.02750.00610.0062
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8189 0.7708 0.2720 0.7500 0.3992 0.3190\n", - "1 0.8170 0.7766 0.2833 0.7194 0.4065 0.3217\n", - "2 0.8164 0.8062 0.2691 0.7308 0.3934 0.3114\n", - "3 0.8127 0.7448 0.2776 0.6901 0.3960 0.3082\n", - "4 0.8127 0.7580 0.2776 0.6901 0.3960 0.3082\n", - "5 0.8152 0.7830 0.2720 0.7164 0.3943 0.3103\n", - "6 0.8152 0.7495 0.2890 0.6986 0.4088 0.3209\n", - "7 0.8202 0.7882 0.2635 0.7750 0.3932 0.3165\n", - "8 0.8152 0.7590 0.2833 0.7042 0.4040 0.3174\n", - "9 0.8113 0.7701 0.2720 0.6857 0.3895 0.3017\n", - "Mean 0.8155 0.7706 0.2759 0.7160 0.3981 0.3135\n", - "SD 0.0027 0.0178 0.0073 0.0275 0.0061 0.0062" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "calibrated_rf = calibrate_model(rf)" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(calibrated_rf, plot='calibration')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that how different the above 2 plots look, one is before calibration and one is after. A perfectly calibrated classifier will follow the black dotted line in above plots. Not only `calibrated_rf` is better calibrated but if you notice that `AUC` has also improved from `0.7350` to `0.7706`. By default `calibrate_model()` uses `sigmoid` method which corresponds to Platt's approach. The other available method is `isotonic` which is a non-parametric approach. See an example of calibration using `isotonic` method below: " - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82020.76810.29460.73240.42020.3359
10.81950.77680.30030.72110.42400.3379
20.81770.80480.28610.72140.40970.3249
30.81390.74480.30030.67950.41650.3250
40.81700.75830.29180.71030.41370.3270
50.81700.78030.29180.71030.41370.3270
60.81520.74820.30310.68590.42040.3295
70.82390.78900.28900.77270.42060.3413
80.81580.76020.28900.70340.40960.3224
90.81130.76920.28900.67110.40400.3123
Mean0.81720.77000.29350.71080.41520.3283
SD0.00330.01750.00560.02780.00590.0080
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8202 0.7681 0.2946 0.7324 0.4202 0.3359\n", - "1 0.8195 0.7768 0.3003 0.7211 0.4240 0.3379\n", - "2 0.8177 0.8048 0.2861 0.7214 0.4097 0.3249\n", - "3 0.8139 0.7448 0.3003 0.6795 0.4165 0.3250\n", - "4 0.8170 0.7583 0.2918 0.7103 0.4137 0.3270\n", - "5 0.8170 0.7803 0.2918 0.7103 0.4137 0.3270\n", - "6 0.8152 0.7482 0.3031 0.6859 0.4204 0.3295\n", - "7 0.8239 0.7890 0.2890 0.7727 0.4206 0.3413\n", - "8 0.8158 0.7602 0.2890 0.7034 0.4096 0.3224\n", - "9 0.8113 0.7692 0.2890 0.6711 0.4040 0.3123\n", - "Mean 0.8172 0.7700 0.2935 0.7108 0.4152 0.3283\n", - "SD 0.0033 0.0175 0.0056 0.0278 0.0059 0.0080" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "calibrated_rf_isotonic = calibrate_model(rf, method = 'isotonic')" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(calibrated_rf_isotonic, plot='calibration')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 11.0 Predict on test / hold-out Sample" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In section 10.4 above we have discussed that stacking is less commonly implemented technique of ensembling due to practical difficulties. To understand this more Let's imagine a scenario that the model deployed in production is stacking ensembler of 4 models plus a meta model (similar to `stack_soft` created in section 10.4 above). To generate a prediction on unseen dataset, every data point has to be predicted by all the 4 models of stacking ensembler and then all these prediction has to pass through meta-model to generate a final prediction. As the size of your stacking ensembler increases, it becomes code intensive and hard to maintain and use in production.\n", - "\n", - "In __[Binary Classification Tutorial (CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have seen how to use a trained model to generate prediction on test / hold-out or unseen dataset. In this example we will see it is no different to generate predictions using stacking ensembler in PyCaret. For the purpose of illustration, we will use `stack_soft` created in section 10.4 above for remaining part of this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ModelAccuracyAUCRecallPrec.F1Kappa
0Stacking Classifier0.81460.75660.3410.65570.44870.3504
\n", - "
" - ], - "text/plain": [ - " Model Accuracy AUC Recall Prec. F1 Kappa\n", - "0 Stacking Classifier 0.8146 0.7566 0.341 0.6557 0.4487 0.3504" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "predict_model(stack_soft);" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Accuracy on hold-out sample is **`0.8146`** compared with CV results of **`0.8216`** in section 10.4 above. However, note that there is a significant decline in `AUC` on hold-out set from CV. We will discuss the reasons and how to investigate this in our next tutorial. __[Binary Classification Tutorial (CLF103) - Level Expert](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF103)%20-%20Level%20Expert.ipynb)__. For now we will finish the remaining part of this tutorial using stacking ensembler stored in `stack_soft` variable." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 12.0 Finalize Model for Deployment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In __[Binary Classification Tutorial (CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have learned the purpose of `finalize_model()` and how to do it. In this tutorial we will finalize stacking ensembler and it is no different than finalizing a single model." - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [], - "source": [ - "final_stack_soft = finalize_model(stack_soft)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 13.0 Predict on unseen data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will now use `final_stack_soft` to generate predictions on `data_unseen` that was created in the beginning and it contains 5% (1200 samples) of the original dataset that was never exposed to PyCaret. (see section 5 for explanations)" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6defaultLabelScore
0500002214800000...8011.02028.02453.02329.0431.0300.0500.0000.1770
12000002114022222...89112.04200.04100.03000.03400.03500.00.0110.7896
2500002314412324...15798.02100.01000.02300.00.00.00.0110.6457
3600002213122-100...30384.01132.060994.01436.01047.01056.01053.0110.5704
412000023232-10000...81354.02429.03120.03300.010000.03200.03200.0000.1386
\n", - "

5 rows × 26 columns

\n", - "
" - ], - "text/plain": [ - " LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 \\\n", - "0 50000 2 2 1 48 0 0 0 0 \n", - "1 200000 2 1 1 40 2 2 2 2 \n", - "2 50000 2 3 1 44 1 2 3 2 \n", - "3 60000 2 2 1 31 2 2 -1 0 \n", - "4 120000 2 3 2 32 -1 0 0 0 \n", - "\n", - " PAY_5 ... BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 \\\n", - "0 0 ... 8011.0 2028.0 2453.0 2329.0 431.0 300.0 \n", - "1 2 ... 89112.0 4200.0 4100.0 3000.0 3400.0 3500.0 \n", - "2 4 ... 15798.0 2100.0 1000.0 2300.0 0.0 0.0 \n", - "3 0 ... 30384.0 1132.0 60994.0 1436.0 1047.0 1056.0 \n", - "4 0 ... 81354.0 2429.0 3120.0 3300.0 10000.0 3200.0 \n", - "\n", - " PAY_AMT6 default Label Score \n", - "0 500.0 0 0 0.1770 \n", - "1 0.0 1 1 0.7896 \n", - "2 0.0 1 1 0.6457 \n", - "3 1053.0 1 1 0.5704 \n", - "4 3200.0 0 0 0.1386 \n", - "\n", - "[5 rows x 26 columns]" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "unseen_predictions = predict_model(final_stack_soft, data=data_unseen)\n", - "unseen_predictions.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice the last two columns 'Label' and 'Score'. Label is the prediction and score is the probability of prediction. Notice that predicted results are concated to the original dataset while all the transformations including imputation of missing values (in this case None), categorical encoding, feature extraction etc. is performed under the hood and you dont have to manage the pipeline manually. All this is done automatically in PyCaret." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 14.0 Save the experiment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In __[Binary Classification Tutorial (CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__ we have learned how to save and load the model. In this experiment we will learn how to save the entire experiment including all the outputs and models we have built in this experiment. Saving experiment is as simple as saving model." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Experiment Succesfully Saved\n" - ] - } - ], - "source": [ - "save_experiment('Experiment_123 08Feb2020')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 15.0 Loading saved experiment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To load a saved experiment on a future date or in a different environment, we would use the `load_experiment()` function." - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Object
0Classification Setup Config
1X_training Set
2y_training Set
3X_test Set
4y_test Set
5Transformation Pipeline
6Compare Models Score Grid
7Decision Tree
8Decision Tree Score Grid
9Decision Tree
10Decision Tree Score Grid
11Tuned RandomForestClassifier
12Tuned RandomForestClassifier Score Grid
13Tuned RandomForestClassifier
14Tuned RandomForestClassifier Score Grid
15Decision Tree
16Decision Tree Score Grid
17BaggingClassifier
18BaggingClassifier Score Grid
19AdaBoostClassifier
20AdaBoostClassifier Score Grid
21BaggingClassifier
22BaggingClassifier Score Grid
23Tuned BaggingClassifier
24Tuned BaggingClassifier Score Grid
25Voting Classifier
26Voting Classifier Score Grid
27Voting Classifier
28Voting Classifier Score Grid
29Gradient Boosting Classifier
30Gradient Boosting Classifier Score Grid
31Decision Tree
32Decision Tree Score Grid
33Light Gradient Boosting Machine
34Light Gradient Boosting Machine Score Grid
35Extreme Gradient Boosting
36Extreme Gradient Boosting Score Grid
37Voting Classifier
38Voting Classifier Score Grid
39Voting Classifier
40Voting Classifier Score Grid
41Stacking Classifier (Single Layer)
42Stacking Classifier (Single Layer) Score Grid
43Stacking Classifier (Single Layer)
44Stacking Classifier (Single Layer) Score Grid
45Stacking Classifier (Single Layer)
46Stacking Classifier (Single Layer) Score Grid
47Stacking Classifier (Single Layer)
48Stacking Classifier (Single Layer) Score Grid
49Random Forest Classifier
50Random Forest Classifier Score Grid
51CalibratedClassifierCV
52CalibratedClassifierCV Score Grid
53CalibratedClassifierCV
54CalibratedClassifierCV Score Grid
55Stacking Classifier (Single Layer)
56Stacking Classifier (Single Layer) Score Grid
57Final [GradientBoostingClassifier
\n", - "
" - ], - "text/plain": [ - " Object\n", - "0 Classification Setup Config\n", - "1 X_training Set\n", - "2 y_training Set\n", - "3 X_test Set\n", - "4 y_test Set\n", - "5 Transformation Pipeline\n", - "6 Compare Models Score Grid\n", - "7 Decision Tree\n", - "8 Decision Tree Score Grid\n", - "9 Decision Tree\n", - "10 Decision Tree Score Grid\n", - "11 Tuned RandomForestClassifier\n", - "12 Tuned RandomForestClassifier Score Grid\n", - "13 Tuned RandomForestClassifier\n", - "14 Tuned RandomForestClassifier Score Grid\n", - "15 Decision Tree\n", - "16 Decision Tree Score Grid\n", - "17 BaggingClassifier\n", - "18 BaggingClassifier Score Grid\n", - "19 AdaBoostClassifier\n", - "20 AdaBoostClassifier Score Grid\n", - "21 BaggingClassifier\n", - "22 BaggingClassifier Score Grid\n", - "23 Tuned BaggingClassifier\n", - "24 Tuned BaggingClassifier Score Grid\n", - "25 Voting Classifier\n", - "26 Voting Classifier Score Grid\n", - "27 Voting Classifier\n", - "28 Voting Classifier Score Grid\n", - "29 Gradient Boosting Classifier\n", - "30 Gradient Boosting Classifier Score Grid\n", - "31 Decision Tree\n", - "32 Decision Tree Score Grid\n", - "33 Light Gradient Boosting Machine\n", - "34 Light Gradient Boosting Machine Score Grid\n", - "35 Extreme Gradient Boosting\n", - "36 Extreme Gradient Boosting Score Grid\n", - "37 Voting Classifier\n", - "38 Voting Classifier Score Grid\n", - "39 Voting Classifier\n", - "40 Voting Classifier Score Grid\n", - "41 Stacking Classifier (Single Layer)\n", - "42 Stacking Classifier (Single Layer) Score Grid\n", - "43 Stacking Classifier (Single Layer)\n", - "44 Stacking Classifier (Single Layer) Score Grid\n", - "45 Stacking Classifier (Single Layer)\n", - "46 Stacking Classifier (Single Layer) Score Grid\n", - "47 Stacking Classifier (Single Layer)\n", - "48 Stacking Classifier (Single Layer) Score Grid\n", - "49 Random Forest Classifier\n", - "50 Random Forest Classifier Score Grid\n", - "51 CalibratedClassifierCV\n", - "52 CalibratedClassifierCV Score Grid\n", - "53 CalibratedClassifierCV\n", - "54 CalibratedClassifierCV Score Grid\n", - "55 Stacking Classifier (Single Layer)\n", - "56 Stacking Classifier (Single Layer) Score Grid\n", - "57 Final [GradientBoostingClassifier" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "saved_experiment = load_experiment('Experiment_123 08Feb2020')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that when you used `load_experiment()`, it has loaded the entire experiments and all the intermediate outputs in variable `saved_experiment`. You can access specific items in a similar way you would access list elements in Python. See example below in which we are accessing our final stacking ensembler and store it in `final_stack_soft_loaded` variable." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "final_stack_soft_loaded = saved_experiment[57]" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6defaultLabelScore
0500002214800000...8011.02028.02453.02329.0431.0300.0500.0000.1770
12000002114022222...89112.04200.04100.03000.03400.03500.00.0110.7896
2500002314412324...15798.02100.01000.02300.00.00.00.0110.6457
3600002213122-100...30384.01132.060994.01436.01047.01056.01053.0110.5704
412000023232-10000...81354.02429.03120.03300.010000.03200.03200.0000.1386
\n", - "

5 rows × 26 columns

\n", - "
" - ], - "text/plain": [ - " LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 \\\n", - "0 50000 2 2 1 48 0 0 0 0 \n", - "1 200000 2 1 1 40 2 2 2 2 \n", - "2 50000 2 3 1 44 1 2 3 2 \n", - "3 60000 2 2 1 31 2 2 -1 0 \n", - "4 120000 2 3 2 32 -1 0 0 0 \n", - "\n", - " PAY_5 ... BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 \\\n", - "0 0 ... 8011.0 2028.0 2453.0 2329.0 431.0 300.0 \n", - "1 2 ... 89112.0 4200.0 4100.0 3000.0 3400.0 3500.0 \n", - "2 4 ... 15798.0 2100.0 1000.0 2300.0 0.0 0.0 \n", - "3 0 ... 30384.0 1132.0 60994.0 1436.0 1047.0 1056.0 \n", - "4 0 ... 81354.0 2429.0 3120.0 3300.0 10000.0 3200.0 \n", - "\n", - " PAY_AMT6 default Label Score \n", - "0 500.0 0 0 0.1770 \n", - "1 0.0 1 1 0.7896 \n", - "2 0.0 1 1 0.6457 \n", - "3 1053.0 1 1 0.5704 \n", - "4 3200.0 0 0 0.1386 \n", - "\n", - "[5 rows x 26 columns]" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "new_prediction = predict_model(final_stack_soft_loaded, data=data_unseen)\n", - "new_prediction.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that results of `unseen_predictions` and `new_prediction` are identical." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 16.0 Wrap-up / Next Steps?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have covered a lot of new concepts in this tutorial. Most importantly, we have seen how to use Exploratory Data Analysis in customizing the pipeline in `setup()` that has improved the results considerably comparing to what we have seen earlier in __[Binary Classification Tutorial (CLF101) - Level Beginner](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF101)%20-%20Level%20Beginner.ipynb)__. We have also learned how to perform and tune ensembling in PyCaret.\n", - "\n", - "In this tutorial, we have covered many significant concepts and how to perform them using `pycaret.classification`. However, there are still few more things to go such as defining and optimizing custom cost function, interpretating more complex tree based models using shapley values, advance ensembling techniques such as multiple layer stacknet and more in pre-processing pipelines. We will cover all this in our next and final tutorial of the `pycaret.classification` series. \n", - "\n", - "See you at the next tutorial. Follow the link to __[Binary Classification Tutorial (CLF103) - Level Expert](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20(CLF103)%20-%20Level%20Expert.ipynb)__" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} -- GitLab