diff --git a/Tutorials/Binary Classification Tutorial Level Beginner - CLF101.ipynb b/Tutorials/Binary Classification Tutorial Level Beginner - CLF101.ipynb deleted file mode 100644 index 3189e5f57959ca1a0c82eb57953db408c2b1ca97..0000000000000000000000000000000000000000 --- a/Tutorials/Binary Classification Tutorial Level Beginner - CLF101.ipynb +++ /dev/null @@ -1,3300 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Binary Classification Tutorial (CLF101) - Level Beginner" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Date Updated: Feb 25, 2020**\n", - "\n", - "# 1.0 Objective of Tutorial\n", - "Welcome to Binary Classification Tutorial (#CLF101). This tutorial assumes that you are new to PyCaret and looking to get started with Binary Classification using `pycaret.classification` Module.\n", - "\n", - "In this tutorial we will learn:\n", - "\n", - "\n", - "* **Getting Data:** How to import data from PyCaret repository?\n", - "* **Setting up Environment:** How to setup experiment in PyCaret to get started with building classification models?\n", - "* **Create Model:** How to create a model, perform stratified cross validation and evaluate classification metrics?\n", - "* **Tune Model:** How to automatically tune the hyper-parameters of a classification model?\n", - "* **Plot Model:** How to analyze model performance using various plots?\n", - "* **Finalize Model:** How to finalize the best model at the end of experiment?\n", - "* **Predict Model:** How to make prediction on new / unseen dataset? \n", - "* **Save / Load Model:** How to save / load model for future use?\n", - "\n", - "Read Time : Approx. 30 Minutes\n", - "\n", - "\n", - "## 1.1 Installing PyCaret\n", - "First step to get started with PyCaret is to install pycaret. Installing pycaret is easy and take few minutes only. Follow the instructions below:\n", - "\n", - "#### Installing PyCaret in Local Jupyter Notebook\n", - "`pip install pycaret`
\n", - "\n", - "#### Installing PyCaret on Google Colab or Azure Notebooks\n", - "`!pip install pycaret`\n", - "\n", - "\n", - "## 1.2 Pre-Requisites\n", - "- Python 3.x\n", - "- Latest version of pycaret\n", - "- Internet connection to load data from pycaret's repository\n", - "- Basic Knowledge of Binary Classification\n", - "\n", - "\n", - "## 1.3 For Google colab users:\n", - "If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.
\n", - "
\n", - "`from pycaret.utils import enable_colab`
\n", - "`enable_colab()`\n", - "\n", - "\n", - "## 1.4 See also:\n", - "- __[Binary Classification Tutorial (CLF102) - Intermediate Level](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20Level%20Intermediate%20-%20CLF102.ipynb)__\n", - "- __[Binary Classification Tutorial (CLF103) - Expert Level](https://github.com/pycaret/pycaret/blob/master/Tutorials/Binary%20Classification%20Tutorial%20Level%20Expert%20-%20CLF103.ipynb)__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 2.0 What is Binary Classification?\n", - "Binary classification is a supervised machine learning technique where the goal is to predict the categorical class labels which is discrete and unoredered such as Pass/Fail, Positive/Negative, Default/Not-Default etc. Few real world use case for classification are enlisted below:\n", - "\n", - "- Medical testing to determine if a patient has certain disease or not – the classification property is the presence of the disease.\n", - "- A \"pass or fail\" test method or quality control in factories, i.e. deciding if a specification has or has not been met – a go/no-go classification.\n", - "- Information retrieval, namely deciding whether a page or an article should be in the result set of a search or not – the classification property is the relevance of the article, or the usefulness to the user.\n", - "\n", - "__[Learn More about Binary Classification](https://medium.com/@categitau/in-one-of-my-previous-posts-i-introduced-machine-learning-and-talked-about-the-two-most-common-c1ac6e18df16)__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 3.0 Overview of Classification Module in PyCaret\n", - "PyCaret's classification module (`pycaret.classification`) is a supervised machine learning module which is used for classifying the elements into binary group based on various techniques and algorithms. Some common use case of classification problem includes Predicting customer default (Yes or No), Predicting Customer Churn (Customer will leave or stay), Disease Found (Positive or Negative).\n", - "\n", - "PyCaret classification module can be used for Binary or Multi-class classification problems. It has over 18 algorithms and 14 plots to analyze the performance of the model. Be it hyper-parameter tuning, ensembling or advanced techniques like stacking, PyCaret's classification module has it all." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 4.0 Dataset for the Tutorial" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For this tutorial we will use a dataset from UCI called **Default of Credit Card Clients Dataset**. This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features. Short descriptions for which are as follows:\n", - "\n", - "- **ID:** ID of each client\n", - "- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit\n", - "- **SEX:** Gender (1=male, 2=female)\n", - "- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)\n", - "- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)\n", - "- **AGE:** Age in years\n", - "- **PAY_0 to PAY_6:** Repayment status n months ago (PAY_0 = last month ... PAY_6 = 6 months ago) Labels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)\n", - "- **BILL_AMT1 to BILL_AMT6:** Amount of bill statement n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)\n", - "- **PAY_AMT1 to PAY_AMT6:** Amount of payment n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)\n", - "- **default:** Default payment (1=yes, 0=no) `Target Column`\n", - "\n", - "#### Dataset Acknowledgement:\n", - "Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.\n", - "\n", - "The original dataset and data dictionary can be __[found here.](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__ " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 5.0 Getting the Data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can download the data from the original source __[found here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__ and load it using pandas __[(Learn How)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)__ or you can use PyCaret's data respository to load the data using `get_data()` function (This will require internet connection)." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT4BILL_AMT5BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6default
0200002212422-1-1-2...0.00.00.00.0689.00.00.00.00.01
1900002223400000...14331.014948.015549.01518.01500.01000.01000.01000.05000.00
2500002213700000...28314.028959.029547.02000.02019.01200.01100.01069.01000.00
35000012157-10-100...20940.019146.019131.02000.036681.010000.09000.0689.0679.00
4500001123700000...19394.019619.020024.02500.01815.0657.01000.01000.0800.00
\n", - "

5 rows × 24 columns

\n", - "
" - ], - "text/plain": [ - " LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 \\\n", - "0 20000 2 2 1 24 2 2 -1 -1 \n", - "1 90000 2 2 2 34 0 0 0 0 \n", - "2 50000 2 2 1 37 0 0 0 0 \n", - "3 50000 1 2 1 57 -1 0 -1 0 \n", - "4 50000 1 1 2 37 0 0 0 0 \n", - "\n", - " PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 \\\n", - "0 -2 ... 0.0 0.0 0.0 0.0 689.0 0.0 \n", - "1 0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 \n", - "2 0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 \n", - "3 0 ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 \n", - "4 0 ... 19394.0 19619.0 20024.0 2500.0 1815.0 657.0 \n", - "\n", - " PAY_AMT4 PAY_AMT5 PAY_AMT6 default \n", - "0 0.0 0.0 0.0 1 \n", - "1 1000.0 1000.0 5000.0 0 \n", - "2 1100.0 1069.0 1000.0 0 \n", - "3 9000.0 689.0 679.0 0 \n", - "4 1000.0 1000.0 800.0 0 \n", - "\n", - "[5 rows x 24 columns]" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from pycaret.datasets import get_data\n", - "dataset = get_data('credit')" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(24000, 24)" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "#check the shape of data\n", - "dataset.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to demonstrate the `predict_model()` function on unseen data, a sample of 1200 records are taken out from original dataset to be used for predictions. This should not be confused with train/test split. This particular split is performed to simulate real life scenario. Another way to think about this is that these 1200 records are not available at the time when machine learning experiment was performed." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Data for Modeling: (22800, 24)\n", - "Unseen Data For Predictions: (1200, 24)\n" - ] - } - ], - "source": [ - "data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)\n", - "data_unseen = dataset.drop(data.index).reset_index(drop=True)\n", - "\n", - "print('Data for Modeling: ' + str(data.shape))\n", - "print('Unseen Data For Predictions: ' + str(data_unseen.shape))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 6.0 Setting up Environment in PyCaret" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`setup()` function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. `setup()` must be called before executing any other function in pycaret. It takes two mandatory parameters: pandas dataframe and name of the target column. All other parameters are optional and are used to customize pre-processing pipeline (we will see them in later tutorials).\n", - "\n", - "When `setup()` is executed PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. Although, most of the times the data type is inferred correctly but it's not always the case. Therefore, after `setup()` is executed, PyCaret displays a table containing features and their inferred data types. At which stage, you can inspect and press `enter` to continue if all data types are correctly infered or type `quit` to end the experiment. Identifying data types correctly is of fundamental importance in PyCaret as it automatically performs few pre-processing tasks which are imperative to perform any machine learning experiment. These pre-processing tasks are performed differently for each data type. As such, it is very important that data types are correctly configured.\n", - "\n", - "In later tutorials we will learn how to overwrite PyCaret's infered data type using `numeric_features` and `categorical_features` parameter in `setup()`." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "from pycaret.classification import *" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " \n", - "Setup Succesfully Completed!\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Description Value
0session_id123
1Target TypeBinary
2Label EncodedNone
3Original Data(22800, 24)
4Missing Values False
5Numeric Features 14
6Categorical Features 9
7Ordinal Features False
8High Cardinality Features False
9High Cardinality Method None
10Sampled Data(22800, 24)
11Transformed Train Set(15959, 90)
12Transformed Test Set(6841, 90)
13Numeric Imputer mean
14Categorical Imputer constant
15Normalize False
16Normalize Method None
17Transformation False
18Transformation Method None
19PCA False
20PCA Method None
21PCA Components None
22Ignore Low Variance False
23Combine Rare Levels False
24Rare Level Threshold None
25Numeric Binning False
26Remove Outliers False
27Outliers Threshold None
28Remove Multicollinearity False
29Multicollinearity Threshold None
30Clustering False
31Clustering Iteration None
32Polynomial Features False
33Polynomial Degree None
34Trignometry Features False
35Polynomial Threshold None
36Group Features False
37Feature Selection False
38Features Selection Threshold None
39Feature Interaction False
40Feature Ratio False
41Interaction Threshold None
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "exp_clf101 = setup(data = data, target = 'default', session_id=123) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once the setup is succesfully executed it prints the information grid that contains few important information. Much of the information is related to pre-processing pipeline which is constructed when `setup()` is executed. Much of these features are out of scope for the purpose of this tutorial. However, few important things to note at this stage are:\n", - "\n", - "- **session_id :** A pseduo-random number distributed as a seed in all functions for later reproducibility. If no `session_id` is passed, a random number is automatically generated that is distributed to all functions. In this experiment session_id is set as `123` for later reproducibility.
\n", - "
\n", - "- **Target Type :** Binary or Multiclass. Target type is automatically detected and shown. There is on difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.
\n", - "
\n", - "- **Label Encoded :** When the Target variable is of type string ('Yes' or 'No') instead of 1 or 0. It automatically encodes the label into 1 and 0 and displays the mapping (0 : No, 1 : Yes) for reference. In this experiment No Label encoding is required as target variable is of type numeric.
\n", - "
\n", - "- **Original Data :** Displays the original shape of dataset. In this experiment (22800, 24) means 22,800 samples and 24 features including target column.
\n", - "
\n", - "- **Missing Values :** When there are missing values in original data it will show as True. For this experiment there are no missing values in the dataset.
\n", - "
\n", - "- **Numeric Features :** Number of features inferred as numeric. In this dataset, 14 out of 24 features are inferred as numeric.
\n", - "
\n", - "- **Categorical Features :** Number of features inferred as categorical. In this dataset, 9 out of 24 features are inferred as categorical.
\n", - "
\n", - "- **Transformed Train Set :** Displays the shape of transformed training set. Notice that original shape of (22800, 24) is transformed into (15959, 90) for transformed train set. Number of samples are distributed between train and test/hold-out set (see below) and number of features have increased to 90 from 24 due to categorical encoding
\n", - "
\n", - "- **Transformed Test Set :** Displays the shape of transformed test/hold-out set. There are 6,841 samples in test/hold-out set. This split is based on default value of 70/30 that can be changed using `train_size` parameter in setup.
\n", - "\n", - "Notice that how few tasks that are imperative to perform modeling are automatically handled such as missing value imputation (in this case there are no missing values in training data, but we still need imputers for unseen data), categorical encoding etc. Most of the parameters in `setup()` are optional and used for customizing pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to intermediate and expert level, we will cover them in much detail)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 7.0 Comparing All Models" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Comparing all models to evaluate performance is the recommended starting point for modeling once setup is completed (unless you exactly know what kind of model you need, which is often not the case). This function trains all models in the model library and scores them using stratified cross validation for metric evaluation. The output prints a score grid that shows average Accuracy, AUC, Recall, Precision, F1 and Kappa accross the folds (10 by default) of all the available models in the model library." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Model Accuracy AUC Recall Prec. F1 Kappa
0Ridge Classifier0.823600.36460.69320.47760.3836
1Linear Discriminant Analysis0.82360.77030.38130.68180.48880.3923
2Gradient Boosting Classifier0.82250.78870.36490.6870.47630.3813
3CatBoost Classifier0.8220.78670.38730.66910.49040.3917
4Extreme Gradient Boosting0.82180.78940.35950.68620.47150.3767
5Light Gradient Boosting Machine0.82140.78590.38780.66630.490.3908
6Ada Boost Classifier0.81850.77830.35070.67290.46070.3644
7Extra Trees Classifier0.80930.75330.38390.610.47110.362
8Random Forest Classifier0.80840.7380.33370.62540.43490.3323
9Quadratic Discriminant Analysis0.78930.73920.17340.62760.23780.1698
10Logistic Regression0.77860.65080.00060.0750.00110.0001
11K Neighbors Classifier0.75050.60990.18020.36930.24210.1134
12Decision Tree Classifier0.72940.61970.42210.39530.40810.233
13SVM - Linear Kernel0.65300.25470.10560.1120.0121
14Naive Bayes0.36510.64570.9020.24550.38590.0585
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "compare_models()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Two simple words of code ***(not even a line)*** have created over 15 models using 10 fold stratified cross validation and evaluated 6 most commonly use classification metrics (Accuracy, AUC, Recall, Precision, F1, Kappa). The score grid printed above highlights the highest performing metric for comparison purpose only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing `sort` parameter. For example `compare_models(sort = 'AUC')` will sort the grid by AUC instead of accuracy. If you want to change the fold parameter from default value of `10` to a different value then you can use fold parameter. For example `compare_models(fold = 5)` to compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 8.0 Create a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "While `compare_models()` is a great function and often a starting point in any experiment, it doesn't return you any trained model. PyCaret's recommended experiment workflow is to use `compare_models()` right after setup to evaluate top performing models and finalize few candidate models to continue the experiment. As such, the function that actually allows to you create a model is unimaginatively called `create_model()`. This function creates a model and scores it using stratified cross validation. Similar to `compare_models()`, the output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold. \n", - "\n", - "For the remaining part of this tutorial, we will work with the below models as our candidate models, our choice of selecting them as candidate model is for illustration purpose only and doesn't necessarily mean they are top performing or ideal model for this type of data.\n", - "\n", - "- Decision Tree Classifier ('dt')\n", - "- K Neighbors Classifier ('knn')\n", - "- Random Forest Classifier ('rf')\n", - "\n", - "There are 18 classifiers available in the model library of PyCaret. Please see docstring of `create_model()` for the list of all available models." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 8.1 Decision Tree Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.73250.62130.42210.40050.41100.2381
10.72370.61990.43060.38780.40810.2285
20.74000.63610.45040.41840.43380.2654
30.71550.59840.38530.36460.37470.1907
40.73750.61750.40230.40570.40400.2356
50.73560.63760.46180.41270.43580.2639
60.72370.62100.43630.38890.41120.2315
70.73870.63650.45330.41670.43420.2647
80.72620.60620.39090.38330.38710.2108
90.72100.60300.38810.37430.38110.2011
Mean0.72940.61970.42210.39530.40810.2330
SD0.00810.01340.02740.01750.02110.0252
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7325 0.6213 0.4221 0.4005 0.4110 0.2381\n", - "1 0.7237 0.6199 0.4306 0.3878 0.4081 0.2285\n", - "2 0.7400 0.6361 0.4504 0.4184 0.4338 0.2654\n", - "3 0.7155 0.5984 0.3853 0.3646 0.3747 0.1907\n", - "4 0.7375 0.6175 0.4023 0.4057 0.4040 0.2356\n", - "5 0.7356 0.6376 0.4618 0.4127 0.4358 0.2639\n", - "6 0.7237 0.6210 0.4363 0.3889 0.4112 0.2315\n", - "7 0.7387 0.6365 0.4533 0.4167 0.4342 0.2647\n", - "8 0.7262 0.6062 0.3909 0.3833 0.3871 0.2108\n", - "9 0.7210 0.6030 0.3881 0.3743 0.3811 0.2011\n", - "Mean 0.7294 0.6197 0.4221 0.3953 0.4081 0.2330\n", - "SD 0.0081 0.0134 0.0274 0.0175 0.0211 0.0252" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "dt = create_model('dt')" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", - " max_depth=None, max_features=None, max_leaf_nodes=None,\n", - " min_impurity_decrease=0.0, min_impurity_split=None,\n", - " min_samples_leaf=1, min_samples_split=2,\n", - " min_weight_fraction_leaf=0.0, presort='deprecated',\n", - " random_state=123, splitter='best')\n" - ] - } - ], - "source": [ - "#trained model object is stored in the variable 'dt'. \n", - "print(dt)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 8.2 K Neighbors Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.74060.58790.16710.32960.22180.0857
10.73500.57870.14730.29890.19730.0601
20.76320.66480.20960.42770.28140.1590
30.74620.59840.15300.33750.21050.0842
40.75500.60980.20400.39560.26920.1397
50.76130.62030.18700.41250.25730.1384
60.74120.58830.16710.33150.22220.0868
70.75940.61390.18980.40610.25870.1371
80.74870.61000.18980.36810.25050.1177
90.75420.62640.18700.38600.25190.1256
Mean0.75050.60990.18020.36930.24210.1134
SD0.00910.02340.01970.04070.02600.0305
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7406 0.5879 0.1671 0.3296 0.2218 0.0857\n", - "1 0.7350 0.5787 0.1473 0.2989 0.1973 0.0601\n", - "2 0.7632 0.6648 0.2096 0.4277 0.2814 0.1590\n", - "3 0.7462 0.5984 0.1530 0.3375 0.2105 0.0842\n", - "4 0.7550 0.6098 0.2040 0.3956 0.2692 0.1397\n", - "5 0.7613 0.6203 0.1870 0.4125 0.2573 0.1384\n", - "6 0.7412 0.5883 0.1671 0.3315 0.2222 0.0868\n", - "7 0.7594 0.6139 0.1898 0.4061 0.2587 0.1371\n", - "8 0.7487 0.6100 0.1898 0.3681 0.2505 0.1177\n", - "9 0.7542 0.6264 0.1870 0.3860 0.2519 0.1256\n", - "Mean 0.7505 0.6099 0.1802 0.3693 0.2421 0.1134\n", - "SD 0.0091 0.0234 0.0197 0.0407 0.0260 0.0305" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "knn = create_model('knn')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 8.3 Random Forest Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.80640.74350.32290.61960.42460.3218
10.80760.74880.33140.62230.43250.3295
20.81700.75760.36830.65330.47100.3707
30.81390.71750.35130.64580.45500.3544
40.80260.71990.28900.61450.39310.2930
50.81140.74460.35690.63000.45570.3520
60.80080.72860.32580.58970.41970.3113
70.80700.75020.31440.62710.41890.3181
80.81140.72620.33710.63980.44160.3410
90.80630.74280.33990.61220.43720.3315
Mean0.80840.73800.33370.62540.43490.3323
SD0.00480.01310.02160.01740.02130.0217
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8064 0.7435 0.3229 0.6196 0.4246 0.3218\n", - "1 0.8076 0.7488 0.3314 0.6223 0.4325 0.3295\n", - "2 0.8170 0.7576 0.3683 0.6533 0.4710 0.3707\n", - "3 0.8139 0.7175 0.3513 0.6458 0.4550 0.3544\n", - "4 0.8026 0.7199 0.2890 0.6145 0.3931 0.2930\n", - "5 0.8114 0.7446 0.3569 0.6300 0.4557 0.3520\n", - "6 0.8008 0.7286 0.3258 0.5897 0.4197 0.3113\n", - "7 0.8070 0.7502 0.3144 0.6271 0.4189 0.3181\n", - "8 0.8114 0.7262 0.3371 0.6398 0.4416 0.3410\n", - "9 0.8063 0.7428 0.3399 0.6122 0.4372 0.3315\n", - "Mean 0.8084 0.7380 0.3337 0.6254 0.4349 0.3323\n", - "SD 0.0048 0.0131 0.0216 0.0174 0.0213 0.0217" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "rf = create_model('rf')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that Mean score of all models matches with the score printed in `compare_models()`. This is because the metrics printed in `compare_models()` score grid are average score of CV folds. Similar to `compare_models()`, if you want to change the fold parameter from default value of 10 to a different value then you can use fold parameter, For Example: `create_model('dt', fold = 5)` to create Decision Tree Classifier using 5 fold stratified CV." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 9.0 Tune a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When a model is created using `create_model()` function, it uses the default hyperparameters of the model. In order to tune the hyperparameter of any model, `tune_model()` function is used. This function automatically tunes the hyperparameters of a model on a pre-defined search space and scores it using stratified cross validation. The output prints a score grid that shows Accuracy, AUC, Recall Precision, F1 and Kappa by fold.
\n", - "
\n", - "**Note:** `tune_model()` does not take a trained model object as an input instead it requires model name to be passed as an abbreviated string similar to how it is passed in `create_model()`. All other functions in `pycaret.classification` requires a trained model object as argument." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 9.1 Decision Tree Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82640.72190.32010.75330.44930.3656
10.82390.72590.30590.75000.43460.3515
20.81580.75480.29460.69800.41430.3258
30.82080.71710.31730.71340.43920.3508
40.81330.71780.31730.66270.42910.3337
50.82390.73590.30030.75710.43000.3481
60.81640.71970.29750.70000.41750.3290
70.82640.74080.30310.77540.43580.3557
80.82140.71580.30030.73610.42660.3423
90.81070.73220.30030.65840.41250.3179
Mean0.81990.72820.30570.72040.42890.3420
SD0.00530.01200.00870.03840.01110.0143
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8264 0.7219 0.3201 0.7533 0.4493 0.3656\n", - "1 0.8239 0.7259 0.3059 0.7500 0.4346 0.3515\n", - "2 0.8158 0.7548 0.2946 0.6980 0.4143 0.3258\n", - "3 0.8208 0.7171 0.3173 0.7134 0.4392 0.3508\n", - "4 0.8133 0.7178 0.3173 0.6627 0.4291 0.3337\n", - "5 0.8239 0.7359 0.3003 0.7571 0.4300 0.3481\n", - "6 0.8164 0.7197 0.2975 0.7000 0.4175 0.3290\n", - "7 0.8264 0.7408 0.3031 0.7754 0.4358 0.3557\n", - "8 0.8214 0.7158 0.3003 0.7361 0.4266 0.3423\n", - "9 0.8107 0.7322 0.3003 0.6584 0.4125 0.3179\n", - "Mean 0.8199 0.7282 0.3057 0.7204 0.4289 0.3420\n", - "SD 0.0053 0.0120 0.0087 0.0384 0.0111 0.0143" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_dt = tune_model('dt')" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", - " max_depth=3, max_features=78, max_leaf_nodes=None,\n", - " min_impurity_decrease=0.0, min_impurity_split=None,\n", - " min_samples_leaf=3, min_samples_split=2,\n", - " min_weight_fraction_leaf=0.0, presort='deprecated',\n", - " random_state=123, splitter='best')\n" - ] - } - ], - "source": [ - "#tuned model object is stored in the variable 'tuned_dt'. \n", - "print(tuned_dt)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 9.2 K Neighbors Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.77380.64940.07080.43100.12170.0632
10.77690.66620.06520.46940.11440.0640
20.78380.70610.11610.55410.19200.1250
30.77690.64070.06230.46810.11000.0612
40.77690.65070.07080.47170.12320.0694
50.77880.68580.08780.50000.14940.0892
60.78070.65160.07650.52940.13370.0824
70.78450.65910.09350.57890.16100.1060
80.77690.65650.06520.46940.11440.0640
90.78120.68070.09350.53230.15900.0995
Mean0.77910.66470.08020.50040.13790.0824
SD0.00320.01920.01630.04420.02530.0209
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.7738 0.6494 0.0708 0.4310 0.1217 0.0632\n", - "1 0.7769 0.6662 0.0652 0.4694 0.1144 0.0640\n", - "2 0.7838 0.7061 0.1161 0.5541 0.1920 0.1250\n", - "3 0.7769 0.6407 0.0623 0.4681 0.1100 0.0612\n", - "4 0.7769 0.6507 0.0708 0.4717 0.1232 0.0694\n", - "5 0.7788 0.6858 0.0878 0.5000 0.1494 0.0892\n", - "6 0.7807 0.6516 0.0765 0.5294 0.1337 0.0824\n", - "7 0.7845 0.6591 0.0935 0.5789 0.1610 0.1060\n", - "8 0.7769 0.6565 0.0652 0.4694 0.1144 0.0640\n", - "9 0.7812 0.6807 0.0935 0.5323 0.1590 0.0995\n", - "Mean 0.7791 0.6647 0.0802 0.5004 0.1379 0.0824\n", - "SD 0.0032 0.0192 0.0163 0.0442 0.0253 0.0209" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_knn = tune_model('knn')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 9.3 Random Forest Classifier" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AccuracyAUCRecallPrec.F1Kappa
00.82890.78710.34840.74100.47400.3873
10.82270.79900.35130.69660.46700.3743
20.82210.82510.38240.67160.48740.3894
30.82270.77400.36260.68820.47500.3804
40.82270.77950.35130.69660.46700.3743
50.82390.79180.35980.69780.47480.3817
60.82020.75360.33430.69410.45120.3591
70.83020.79350.35690.74120.48180.3948
80.82140.76490.35410.68680.46730.3729
90.81380.78240.35980.64140.46100.3590
Mean0.82290.78510.35610.69550.47070.3773
SD0.00430.01860.01160.02790.00980.0113
\n", - "
" - ], - "text/plain": [ - " Accuracy AUC Recall Prec. F1 Kappa\n", - "0 0.8289 0.7871 0.3484 0.7410 0.4740 0.3873\n", - "1 0.8227 0.7990 0.3513 0.6966 0.4670 0.3743\n", - "2 0.8221 0.8251 0.3824 0.6716 0.4874 0.3894\n", - "3 0.8227 0.7740 0.3626 0.6882 0.4750 0.3804\n", - "4 0.8227 0.7795 0.3513 0.6966 0.4670 0.3743\n", - "5 0.8239 0.7918 0.3598 0.6978 0.4748 0.3817\n", - "6 0.8202 0.7536 0.3343 0.6941 0.4512 0.3591\n", - "7 0.8302 0.7935 0.3569 0.7412 0.4818 0.3948\n", - "8 0.8214 0.7649 0.3541 0.6868 0.4673 0.3729\n", - "9 0.8138 0.7824 0.3598 0.6414 0.4610 0.3590\n", - "Mean 0.8229 0.7851 0.3561 0.6955 0.4707 0.3773\n", - "SD 0.0043 0.0186 0.0116 0.0279 0.0098 0.0113" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tuned_rf = tune_model('rf')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`tune_model()` function is a random grid search of hyperparameters over a pre-defined search space. By default, it is set to optimize `Accuracy` but this can be changed using `optimize` parameter. For example: `tune_model('dt', optimize = 'AUC')` will search for hyperparameters of Decision Tree Classifier that results in highest `AUC`. For the purpose of this example, we have used the default metric `Accuracy` for the sake of simplicity only. Generally, when the dataset is imbalanced (such as the credit dataset we are working on) `Accuracy` is not a good metric to consider. How to choose the right metric to evaluate classifier is beyond the scope of this tutorial. However if you would like to learn more about it, you can __[click here](https://medium.com/@george.drakos62/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-3-classification-3eac420ec991)__ to read an article on how to choose the right evaluation metric.\n", - "\n", - "Notice how the results after tuning have been improved:\n", - "\n", - "- Decision Tree Classifier (Before: **`0.7294`** , After: **`0.8199`**)\n", - "- K Neighbors Classifier (Before: **`0.7505`** , After: **`0.7791`**)\n", - "- Random Forest Classifier (Before: **`0.8084`** , After: **`0.8229`**)\n", - "\n", - "Although metrics alone are not the only criteria you should consider when finalizing the best model for production. There are few other factors to consider such as training time, standard deviation of kfolds etc. As you progress through the tutorial, we have discussed those factors in detail in Intermediate and Expert levels. For now, let's move forward considering Random Forest Classifier as our best model. We will complete the remaining tutorial using Tuned Random Forest Classifier." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 10.0 Plot a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before finalizing the model, `plot_model()` function can be used to analyze the performance of model over different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set. \n", - "\n", - "There are 15 different plots available, please see docstring of `plot_model()` for list of available plots." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.1 AUC Plot" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(tuned_rf, plot = 'auc')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.2 Precision-Recall Curve" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(tuned_rf, plot = 'pr')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.3 Feature Importance Plot" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(tuned_rf, plot='feature')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 10.4 Confusion Matrix" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "plot_model(tuned_rf, plot = 'confusion_matrix')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another way to analyze performance of models is to use evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function. " - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "19a7e9d1631c465f9fec92694230ab23", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "evaluate_model(tuned_rf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 11.0 Predict on test / hold-out Sample" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and evaluate metrics on hold-out set. If you see information grid in Section 6 above, you will see that 30% (6,841 samples) of the data has been separated out as test/hold-out sample. All the evaluation metrics we have seen above is cross validated results based on training set (70%) only. Now using our final trained model stored in `tuned_rf` variable we will predict the hold-out sample and evaluate the metrics to see if they are materially different than the CV results." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ModelAccuracyAUCRecallPrec.F1Kappa
0Random Forest Classifier0.81260.75380.32120.65590.43120.3345
\n", - "
" - ], - "text/plain": [ - " Model Accuracy AUC Recall Prec. F1 Kappa\n", - "0 Random Forest Classifier 0.8126 0.7538 0.3212 0.6559 0.4312 0.3345" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "predict_model(tuned_rf);" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Accuracy on test/hold-out set is **`0.8126`** compared to **`0.8229`** achieved on `tuned_rf` CV results (in section 9.3 above). This is not a significant difference. If the difference between test/hold-out and CV results is large, this would normally indicate over-fitting, but could also be due to several other factors and would require investigation. In this case, we will move forward with finalizing this model and predict on unseen data (5% that we had separated in the beginning - that was never exposed to PyCaret).\n", - "\n", - "(TIP : It's always good to look at the standard deviation of CV results when using `create_model()`." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 12.0 Finalize Model for Deployment" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finalize Model is the last step in the experiment. A normal machine learning workflow in PyCaret starts with `setup()`, followed by comparing all models using `compare_models()` and shortlisting few candidate models (based on the metric of interest) to perform several modeling techniques such as hyperparameter tuning, ensembling, stacking etc. This workflow will eventually lead you to the best model which you would like to use for making prediction on new and unseen data. `finalize_model()` function fits the model onto the complete dataset including test/hold-out sample (30% in this case). The purpose of this function is to train the model on complete dataset before it is deployed in production." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "final_rf = finalize_model(tuned_rf)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", - " criterion='gini', max_depth=10, max_features='auto',\n", - " max_leaf_nodes=None, max_samples=None,\n", - " min_impurity_decrease=0.0, min_impurity_split=None,\n", - " min_samples_leaf=2, min_samples_split=10,\n", - " min_weight_fraction_leaf=0.0, n_estimators=70,\n", - " n_jobs=None, oob_score=False, random_state=123,\n", - " verbose=0, warm_start=False)\n" - ] - } - ], - "source": [ - "#Final Random Forest model parameters for deployment\n", - "print(final_rf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Caution:** One final word of caution. Once the model is finalized using `finalize_model()`, entire dataset including test/hold-out set is used for training. As such, if the model is used for predictions on hold-out after `finalize_model()`, the information grid printed is misleading as you are trying to predict the same data that was used for modeling. In order to only demonstrate this point, we will use `final_rf` under `predict_model()` to compare the information grid with the one above in section 11. " - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ModelAccuracyAUCRecallPrec.F1Kappa
0Random Forest Classifier0.83610.81890.36810.77150.49840.4148
\n", - "
" - ], - "text/plain": [ - " Model Accuracy AUC Recall Prec. F1 Kappa\n", - "0 Random Forest Classifier 0.8361 0.8189 0.3681 0.7715 0.4984 0.4148" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "predict_model(final_rf);" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that how AUC in the `final_rf` is increased to **`0.8189`** from **`0.7538`**, even though the model is same. This is because `final_rf` variable is trained on the complete dataset including test/hold-out set." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 13.0 Predict on unseen data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`predict_model()` function is also used to predict the unseen dataset. The only difference from section 11 above is this time we will pass `data_unseen` in data parameter of `predict_model()`. `data_unseen` is the variable created in the beginning and it contains 5% (1200 samples) of the original dataset that was never exposed to PyCaret. (see section 5 for explainantion)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6defaultLabelScore
0500002214800000...8011.02028.02453.02329.0431.0300.0500.0000.1498
12000002114022222...89112.04200.04100.03000.03400.03500.00.0110.7986
2500002314412324...15798.02100.01000.02300.00.00.00.0110.6261
3600002213122-100...30384.01132.060994.01436.01047.01056.01053.0110.5063
412000023232-10000...81354.02429.03120.03300.010000.03200.03200.0000.1479
\n", - "

5 rows × 26 columns

\n", - "
" - ], - "text/plain": [ - " LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 \\\n", - "0 50000 2 2 1 48 0 0 0 0 \n", - "1 200000 2 1 1 40 2 2 2 2 \n", - "2 50000 2 3 1 44 1 2 3 2 \n", - "3 60000 2 2 1 31 2 2 -1 0 \n", - "4 120000 2 3 2 32 -1 0 0 0 \n", - "\n", - " PAY_5 ... BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 \\\n", - "0 0 ... 8011.0 2028.0 2453.0 2329.0 431.0 300.0 \n", - "1 2 ... 89112.0 4200.0 4100.0 3000.0 3400.0 3500.0 \n", - "2 4 ... 15798.0 2100.0 1000.0 2300.0 0.0 0.0 \n", - "3 0 ... 30384.0 1132.0 60994.0 1436.0 1047.0 1056.0 \n", - "4 0 ... 81354.0 2429.0 3120.0 3300.0 10000.0 3200.0 \n", - "\n", - " PAY_AMT6 default Label Score \n", - "0 500.0 0 0 0.1498 \n", - "1 0.0 1 1 0.7986 \n", - "2 0.0 1 1 0.6261 \n", - "3 1053.0 1 1 0.5063 \n", - "4 3200.0 0 0 0.1479 \n", - "\n", - "[5 rows x 26 columns]" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "unseen_predictions = predict_model(final_rf, data=data_unseen)\n", - "unseen_predictions.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`Label` and `Score` columns are added into the `data_unseen`. Label is the prediction and score is the probability of prediction. Notice that predicted results are concated to the original dataset while all the transformations are automatically performed in the background." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 14.0 Saving the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We have now finished the experiment by finalizing the `tuned_rf` model which is now stored in `final_rf` variable. We have also used model stored in `final_rf` to predict `data_unseen`. This brings us to the end our experiment but one question is still to be asked. What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is No, you don't need to rerun the entire experiment and reconstruct the pipeline to generate predictions on new data. PyCaret inbuilt function `save_model()` allows you to save the model along with entire transformation pipeline for later use." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Transformation Pipeline and Model Succesfully Saved\n" - ] - } - ], - "source": [ - "save_model(final_rf,'Final RF Model 08Feb2020')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(TIP : It's always good to use date in the filename when saving models, it's good for version control.)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 15.0 Loading the saved model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To load a saved model on a future date in the same or different environment, we would use the PyCaret's `load_model()` function and then easily apply the saved model on new unseen data for prediction" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Transformation Pipeline and Model Sucessfully Loaded\n" - ] - } - ], - "source": [ - "saved_final_rf = load_model('Final RF Model 08Feb2020')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once the model is loaded in the environment, you can simply use it to predict on any new data using the same `predict_model()` function . Below we have applied the loaded model to predict the same `data_unseen` that we have used in section 13 above." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "new_prediction = predict_model(saved_final_rf, data=data_unseen)" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
LIMIT_BALSEXEDUCATIONMARRIAGEAGEPAY_1PAY_2PAY_3PAY_4PAY_5...BILL_AMT6PAY_AMT1PAY_AMT2PAY_AMT3PAY_AMT4PAY_AMT5PAY_AMT6defaultLabelScore
0500002214800000...8011.02028.02453.02329.0431.0300.0500.0000.1498
12000002114022222...89112.04200.04100.03000.03400.03500.00.0110.7986
2500002314412324...15798.02100.01000.02300.00.00.00.0110.6261
3600002213122-100...30384.01132.060994.01436.01047.01056.01053.0110.5063
412000023232-10000...81354.02429.03120.03300.010000.03200.03200.0000.1479
\n", - "

5 rows × 26 columns

\n", - "
" - ], - "text/plain": [ - " LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 \\\n", - "0 50000 2 2 1 48 0 0 0 0 \n", - "1 200000 2 1 1 40 2 2 2 2 \n", - "2 50000 2 3 1 44 1 2 3 2 \n", - "3 60000 2 2 1 31 2 2 -1 0 \n", - "4 120000 2 3 2 32 -1 0 0 0 \n", - "\n", - " PAY_5 ... BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 \\\n", - "0 0 ... 8011.0 2028.0 2453.0 2329.0 431.0 300.0 \n", - "1 2 ... 89112.0 4200.0 4100.0 3000.0 3400.0 3500.0 \n", - "2 4 ... 15798.0 2100.0 1000.0 2300.0 0.0 0.0 \n", - "3 0 ... 30384.0 1132.0 60994.0 1436.0 1047.0 1056.0 \n", - "4 0 ... 81354.0 2429.0 3120.0 3300.0 10000.0 3200.0 \n", - "\n", - " PAY_AMT6 default Label Score \n", - "0 500.0 0 0 0.1498 \n", - "1 0.0 1 1 0.7986 \n", - "2 0.0 1 1 0.6261 \n", - "3 1053.0 1 1 0.5063 \n", - "4 3200.0 0 0 0.1479 \n", - "\n", - "[5 rows x 26 columns]" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "new_prediction.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that results of `unseen_predictions` and `new_prediction` are identical." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 16.0 Wrap-up / Next Steps?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "What we have covered in this tutorial is the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction and saving the model for later use. We have completed all this in less than 10 commands which are naturally constructed and very intuitive to remember such as `create_model()`, `tune_model()`, `compare_models()`. Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most of the libraries.\n", - "\n", - "In this tutorial, we have only covered basics of `pycaret.classification`. In the following tutorials, we will go deeper into advance pre-processing techniques that allows you to fully customize your machine learning pipeline, ensembling and generalized stacking and other advance techniques that are must to know for any data scientist. \n", - "\n", - "See you at the next tutorial. Follow the link to __[Binary Classification Tutorial (CLF102) - Intermediate Level](BinaryClassificationTutorial(CLF102)_LevelIntermediate.ipynb)__" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}