Welcome to pycaret’s documentation!¶
PyCaret is an open source low-code machine learning library in Python that aims to reduce the hypothesis to insights cycle time in a ML experiment. It enables data scientists to perform end-to-end experiments quickly and efficiently. In comparison with the other open source machine learning libraries, PyCaret is an alternate low-code library that can be used to perform complex machine learning tasks with only few lines of code. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy and many more.
The design and simplicity of PyCaret is inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise. Seasoned data scientists are often difficult to find and expensive to hire but citizen data scientists can be an effective way to mitigate this gap and address data related challenges in business setting.
PyCaret is simple, easy to use and deployment ready. All the steps performed in a ML experiment can be reproduced using a pipeline that is automatically developed and orchestrated in PyCaret as you progress through the experiment. A pipeline can be saved in a binary file format that is transferable across environments.
For more information on PyCaret, please visit our official website https://www.pycaret.org
Regression¶
-
pycaret.regression.
automl
(optimize='r2')¶ space reserved for docstring
-
pycaret.regression.
blend_models
(estimator_list='All', fold=10, round=4, turbo=True, improve_only=True, optimize='r2', verbose=True)¶ This function creates an ensemble meta-estimator that fits a base regressor on the whole dataset. It then averages the predictions to form a final prediction. By default, this function will use all estimators in the model library (excl. the few estimators when turbo is True) or a specific trained estimator passed as a list in estimator_list param. It scores it using Kfold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Fold).
This function returns a trained model object.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’)
blend_all = blend_models()
This will result in VotingRegressor for all models in the library except ‘ard’, ‘kr’ and ‘mlp’.
For specific models, you can use:
lr = create_model(‘lr’) rf = create_model(‘rf’) knn = create_model(‘knn’)
blend_three = blend_models(estimator_list = [lr,rf,knn])
This will create a VotingRegressor of lr, rf and knn.
- Parameters
estimator_list (string ('All') or list of object, default = 'All') –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
turbo (Boolean, default = True) –
turbo is set to True, it blacklists estimator that uses Radial Kernel. (When) –
improve_only (Boolean, default = True) –
set to True, base estimator is returned when the metric doesn't (When) –
by ensemble_model. This gurantees the returned object would perform (improve) –
equivalent to base estimator created using create_model or model (atleast) –
by compare_models. (returned) –
optimize (string, default = 'r2') –
used when improve_only is set to True. optimize parameter is used (Only) –
compare emsembled model with base estimator. Values accepted in (to) –
parameter are 'mae', 'mse', 'rmse', 'r2', 'rmsle', 'mape'. (optimize) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. (-----------) – Mean and standard deviation of the scores across the folds are also returned.
model (trained Voting Regressor model object.) –
----------- –
Warnings –
--------- –
None –
-
pycaret.regression.
compare_models
(blacklist=None, fold=10, round=4, sort='R2', n_select=1, turbo=True, verbose=True)¶ This function uses all models in the model library and scores them using Kfold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.
When turbo is set to True (‘kr’, ‘ard’ and ‘mlp’) are excluded due to longer training times. By default turbo param is set to True.
List of models in Model Library
Estimator Abbreviated String Original Implementation ——— —————— ———————– Linear Regression ‘lr’ linear_model.LinearRegression Lasso Regression ‘lasso’ linear_model.Lasso Ridge Regression ‘ridge’ linear_model.Ridge Elastic Net ‘en’ linear_model.ElasticNet Least Angle Regression ‘lar’ linear_model.Lars Lasso Least Angle Regression ‘llar’ linear_model.LassoLars Orthogonal Matching Pursuit ‘omp’ linear_model.OMP Bayesian Ridge ‘br’ linear_model.BayesianRidge Automatic Relevance Determ. ‘ard’ linear_model.ARDRegression Passive Aggressive Regressor ‘par’ linear_model.PAR Random Sample Consensus ‘ransac’ linear_model.RANSACRegressor TheilSen Regressor ‘tr’ linear_model.TheilSenRegressor Huber Regressor ‘huber’ linear_model.HuberRegressor Kernel Ridge ‘kr’ kernel_ridge.KernelRidge Support Vector Machine ‘svm’ svm.SVR K Neighbors Regressor ‘knn’ neighbors.KNeighborsRegressor Decision Tree ‘dt’ tree.DecisionTreeRegressor Random Forest ‘rf’ ensemble.RandomForestRegressor Extra Trees Regressor ‘et’ ensemble.ExtraTreesRegressor AdaBoost Regressor ‘ada’ ensemble.AdaBoostRegressor Gradient Boosting ‘gbr’ ensemble.GradientBoostingRegressor Multi Level Perceptron ‘mlp’ neural_network.MLPRegressor Extreme Gradient Boosting ‘xgboost’ xgboost.readthedocs.io Light Gradient Boosting ‘lightgbm’ github.com/microsoft/LightGBM CatBoost Regressor ‘catboost’ https://catboost.ai
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’)
compare_models()
This will return the averaged score grid of all models except ‘kr’, ‘ard’ and ‘mlp’. When turbo param is set to False, all models including ‘kr’, ‘ard’ and ‘mlp’ are used, but this may result in longer training times.
compare_models(blacklist = [‘knn’,’gbr’], turbo = False)
This will return a comparison of all models except K Nearest Neighbour and Gradient Boosting Regressor.
compare_models(blacklist = [‘knn’,’gbr’] , turbo = True)
This will return a comparison of all models except K Nearest Neighbour, Gradient Boosting Regressor, Kernel Ridge Regressor, Automatic Relevance Determinant and Multi Level Perceptron.
- Parameters
blacklist (string, default = None) –
order to omit certain models from the comparison, the abbreviation string (In) –
above list) can be passed as list in blacklist param. This is normally ((see) –
to be more efficient with time. (done) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
sort (string, default = 'MAE') –
scoring measure specified is used for sorting the average score grid (The) –
options are 'MAE', 'MSE', 'RMSE', 'R2', 'RMSLE' and 'MAPE'. (Other) –
n_select (int, default = 3) –
of top_n models to return. use negative argument for bottom selection. (Number) –
example, n_select = -3 means bottom 3 models. (for) –
turbo (Boolean, default = True) –
turbo is set to True, it blacklists estimators that have longer (When) –
times. (training) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE (-----------) – Mean and standard deviation of the scores across the folds is also returned.
Warnings –
--------- –
compare_models() though attractive, might be time consuming with large (-) – datasets. By default turbo is set to True, which blacklists models that have longer training times. Changing turbo parameter to False may result in very high training times with datasets where number of samples exceed 10,000.
-
pycaret.regression.
create_model
(estimator=None, ensemble=False, method=None, fold=10, round=4, verbose=True, **kwargs)¶ This function creates a model and scores it using Kfold Cross Validation. (default = 10 Fold). The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE.
This function returns a trained model object.
setup() function must be called before using create_model()
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’)
lr = create_model(‘lr’)
This will create a trained Linear Regression model.
- Parameters
estimator (string, default = None) –
abbreviated string of the estimator class. List of estimators supported (Enter) –
Abbreviated String Original Implementation (Estimator) –
------------------ ----------------------- (---------) –
Regression 'lr' linear_model.LinearRegression (Linear) –
Regression 'lasso' linear_model.Lasso (Lasso) –
Regression 'ridge' linear_model.Ridge (Ridge) –
Net 'en' linear_model.ElasticNet (Elastic) –
Angle Regression 'lar' linear_model.Lars (Least) –
Least Angle Regression 'llar' linear_model.LassoLars (Lasso) –
Matching Pursuit 'omp' linear_model.OMP (Orthogonal) –
Ridge 'br' linear_model.BayesianRidge (Bayesian) –
Relevance Determ. 'ard' linear_model.ARDRegression (Automatic) –
Aggressive Regressor 'par' linear_model.PAR (Passive) –
Sample Consensus 'ransac' linear_model.RANSACRegressor (Random) –
Regressor 'tr' linear_model.TheilSenRegressor (TheilSen) –
Regressor 'huber' linear_model.HuberRegressor (Huber) –
Ridge 'kr' kernel_ridge.KernelRidge (Kernel) –
Vector Machine 'svm' svm.SVR (Support) –
Neighbors Regressor 'knn' neighbors.KNeighborsRegressor (K) –
Tree 'dt' tree.DecisionTreeRegressor (Decision) –
Forest 'rf' ensemble.RandomForestRegressor (Random) –
Trees Regressor 'et' ensemble.ExtraTreesRegressor (Extra) –
Regressor 'ada' ensemble.AdaBoostRegressor (AdaBoost) –
Boosting 'gbr' ensemble.GradientBoostingRegressor (Gradient) –
Level Perceptron 'mlp' neural_network.MLPRegressor (Multi) –
Gradient Boosting 'xgboost' xgboost.readthedocs.io (Extreme) –
Gradient Boosting 'lightgbm' github.com/microsoft/LightGBM (Light) –
Regressor 'catboost' https (CatBoost) –
ensemble (Boolean, default = False) –
would result in an ensemble of estimator using the method parameter defined. (True) –
method (String, 'Bagging' or 'Boosting', default = None.) –
must be defined when ensemble is set to True. Default method is set to None. (method) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
**kwargs –
keyword arguments to pass to the estimator (Additional) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, RMSLE, R2 and MAPE. (-----------) – Mean and standard deviation of the scores across the folds are also returned.
model (trained model object) –
----------- –
Warnings –
--------- –
None –
-
pycaret.regression.
create_stacknet
(estimator_list, meta_model=None, fold=10, round=4, restack=True, improve_only=True, optimize='r2', finalize=False, verbose=True)¶ This function creates a sequential stack net using cross validated predictions at each layer. The final score grid contains predictions from the meta model using Kfold Cross Validation. Base level models can be passed as estimator_list param, the layers can be organized as a sub list within the estimator_list object. Restacking param controls the ability to expose raw features to meta model.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) dt = create_model(‘dt’) rf = create_model(‘rf’) ada = create_model(‘ada’) ridge = create_model(‘ridge’) knn = create_model(‘knn’)
stacknet = create_stacknet(estimator_list =[[dt,rf],[ada,ridge,knn]])
This will result in the stacking of models in multiple layers. The first layer contains dt and rf, the predictions of which are used by models in the second layer to generate predictions which are then used by the meta model to generate final predictions. By default, the meta model is Linear Regression but can be changed with meta_model param.
- Parameters
estimator_list (nested list of objects) –
meta_model (object, default = None) –
set to None, Linear Regression is used as a meta model. (if) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
restack (Boolean, default = True) –
restack is set to True, raw data and prediction of all layers will be (When) –
to the meta model when making predictions. When set to False, only (exposed) –
predicted label of last layer is passed to meta model when making final (the) –
predictions. –
improve_only (Boolean, default = True) –
set to True, base estimator is returned when the metric doesn't (When) –
by ensemble_model. This gurantees the returned object would perform (improve) –
equivalent to base estimator created using create_model or model (atleast) –
by compare_models. (returned) –
optimize (string, default = 'r2') –
used when improve_only is set to True. optimize parameter is used (Only) –
compare emsembled model with base estimator. Values accepted in (to) –
parameter are 'mae', 'mse', 'rmse', 'r2', 'rmsle', 'mape'. (optimize) –
finalize (Boolean, default = False) –
finalize is set to True, it will fit the stacker on entire dataset (When) –
the hold-out sample created during the setup() stage. It is not (including) –
to set this to True here, if you would like to fit the stacker (recommended) –
the entire dataset including the hold-out, use finalize_model() (on) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. (-----------) – Mean and standard deviation of the scores across the folds are also returned.
container (list of all models where the last element is the meta model.) –
---------- –
Warnings –
--------- –
None –
-
pycaret.regression.
deploy_model
(model, model_name, authentication, platform='aws')¶ (In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
- deploy_model(model = lr, model_name = ‘deploy_lr’, platform = ‘aws’,
authentication = {‘bucket’ : ‘pycaret-test’})
This will deploy the model on AWS S3 account under bucket ‘pycaret-test’
Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your amazon console account:
AWS Access Key ID
AWS Secret Key Access
Default Region Name (can be seen under Global settings on your AWS console)
Default output format (must be left blank)
- Parameters
model (object) –
trained model object should be passed as an estimator. (A) –
model_name (string) –
of model to be passed as a string. (Name) –
authentication (dict) –
of applicable authentication tokens. (dictionary) – When platform = ‘aws’: {‘bucket’ : ‘Name of Bucket on S3’}
platform (string, default = 'aws') –
of platform for deployment. Current available options are (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
This function uses file storage services to deploy the model on cloud platform. (-) – As such, this is efficient for batch-use. Where the production objective is to obtain prediction at an instance level, this may not be the efficient choice as it transmits the binary pickle file between your local python environment and the platform.
-
pycaret.regression.
ensemble_model
(estimator, method='Bagging', fold=10, n_estimators=10, round=4, improve_only=True, optimize='r2', verbose=True)¶ This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds).
This function returns a trained model object.
Model must be created using create_model() or tune_model().
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) dt = create_model(‘dt’)
ensembled_dt = ensemble_model(dt)
This will return an ensembled Decision Tree model using ‘Bagging’.
- Parameters
estimator (object, default = None) –
method (String, default = 'Bagging') –
method will create an ensemble meta-estimator that fits base (Bagging) –
each on random subsets of the original dataset. The other (regressor) –
method is 'Boosting' that fits a regressor on the original (available) –
and then fits additional copies of the regressor on the same (dataset) –
but where the weights of instances are adjusted according to (dataset) –
error of the current prediction. As such, subsequent regressors (the) –
more on difficult cases. (focus) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
n_estimators (integer, default = 10) –
number of base estimators in the ensemble. (The) –
case of perfect fit, the learning procedure is stopped early. (In) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
improve_only (Boolean, default = True) –
set to set to True, base estimator is returned when the metric doesn't (When) –
by ensemble_model. This gurantees the returned object would perform (improve) –
equivalent to base estimator created using create_model or model (atleast) –
by compare_models. (returned) –
optimize (string, default = 'r2') –
used when improve_only is set to True. optimize parameter is used (Only) –
compare emsembled model with base estimator. Values accepted in (to) –
parameter are 'mae', 'mse', 'rmse', 'r2', 'rmsle', 'mape'. (optimize) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
score grid: A table containing the scores of the model across the kfolds. ———– Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE.
Mean and standard deviation of the scores across the folds are also returned.
None
-
pycaret.regression.
evaluate_model
(estimator)¶ This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
evaluate_model(lr)
This will display the User Interface for all of the plots for a given estimator.
- Parameters
estimator (object, default = none) –
trained model object should be passed as an estimator. (A) –
Returns –
-------- –
Interface (User) –
-------------- –
Warnings –
--------- –
None –
-
pycaret.regression.
finalize_model
(estimator)¶ This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
final_lr = finalize_model(lr)
This will return the final model object fitted to complete dataset.
- Parameters
estimator (object, default = none) –
trained model object should be passed as an estimator. (A) –
Returns –
-------- –
Model (Trained model object fitted on complete dataset.) –
------ –
Warnings –
--------- –
If the model returned by finalize_model(), is used on predict_model() without (-) – passing a new unseen dataset, then the information grid printed is misleading as the model is trained on the complete dataset including test / hold-out sample. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseens dataset only.
-
pycaret.regression.
interpret_model
(estimator, plot='summary', feature=None, observation=None)¶ This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms.
This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.
For more information : https://shap.readthedocs.io/en/latest/
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) dt = create_model(‘dt’)
interpret_model(dt)
This will return a summary interpretation plot of Decision Tree model.
- Parameters
estimator (object, default = none) –
trained tree based model object should be passed as an estimator. (A) –
plot (string, default = 'summary') –
available options are 'correlation' and 'reason'. (other) –
feature (string, default = None) –
parameter is only needed when plot = 'correlation'. By default feature is (This) –
to None which means the first column of the dataset will be used as a variable. (set) –
feature parameter must be passed to change this. (A) –
observation (integer, default = None) –
parameter only comes into effect when plot is set to 'reason'. If no observation (This) –
is provided, it will return an analysis of all observations with the option (number) –
select the feature on x and y axes through drop down interactivity. For analysis at (to) –
sample level, an observation parameter must be passed with the index value of the (the) –
in test / hold-out set. (observation) –
Returns –
-------- –
Plot (Visual) –
Returns the interactive JS plot when plot = 'reason'. (-----------) –
Warnings –
--------- –
None –
-
pycaret.regression.
load_experiment
(experiment_name)¶ This function loads a previously saved experiment from the current active directory into current python environment. Load object must be a pickle file.
saved_experiment = load_experiment(‘experiment_23122019’)
This will load the entire experiment pipeline into the object saved_experiment. The experiment file must be in current directory.
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Grid containing details of saved objects in experiment pipeline. (Information) –
Warnings –
--------- –
None –
-
pycaret.regression.
load_model
(model_name, platform=None, authentication=None, verbose=True)¶ This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.
saved_lr = load_model(‘lr_model_23122019’)
This will load the previously saved model in saved_lr variable. The file must be in the current directory.
- Parameters
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
platform (string, default = None) –
of platform, if loading model from cloud. Current available options are (Name) –
'aws'. –
authentication (dict) –
of applicable authentication tokens. (dictionary) – When platform = ‘aws’: {‘bucket’ : ‘Name of Bucket on S3’}
verbose (Boolean, default = True) –
message is not printed when verbose is set to False. (Success) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.regression.
plot_model
(estimator, plot='residuals')¶ This function takes a trained model object and returns a plot based on the test / hold-out set. The process may require the model to be re-trained in certain cases. See list of plots supported below.
Model must be created using create_model() or tune_model().
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
plot_model(lr)
This will return an residuals plot of a trained Linear Regression model.
- Parameters
estimator (object, default = none) –
trained model object should be passed as an estimator. (A) –
plot (string, default = residual) –
abbreviation of type of plot. The current list of plots supported are (Enter) –
Abbreviated String Original Implementation (Name) –
------------------ ----------------------- (---------) –
Plot 'residuals' .. / residuals.html (Residuals) –
Error Plot 'error' .. / peplot.html (Prediction) –
Distance Plot 'cooks' .. / influence.html (Cooks) –
Feat. Selection 'rfe' .. / rfecv.html (Recursive) –
Curve 'learning' .. / learning_curve.html (Learning) –
Curve 'vc' .. / validation_curve.html (Validation) –
Learning 'manifold' .. / manifold.html (Manifold) –
Importance 'feature' N/A (Feature) –
Hyperparameter 'parameter' N/A (Model) –
https (**) –
Returns –
-------- –
Plot (Visual) –
------------ –
Warnings –
--------- –
None –
-
pycaret.regression.
predict_model
(estimator, data=None, platform=None, authentication=None, round=4)¶ This function is used to predict new data using a trained estimator. It accepts an estimator created using one of the function in pycaret that returns a trained model object or a list of trained model objects created using stack_models() or create_stacknet(). New unseen data can be passed to data param as pandas Dataframe. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
lr_predictions_holdout = predict_model(lr)
- Parameters
estimator (object or list of objects / string, default = None) –
estimator is passed as string, load_model() is called internally to load the (When) –
file from active directory or cloud platform when platform param is passed. (pickle) –
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features. All features (is) –
during training must be present in the new dataset. (used) –
platform (string, default = None) –
of platform, if loading model from cloud. Current available options are (Name) –
'aws'. –
authentication (dict) –
of applicable authentication tokens. (dictionary) – When platform = ‘aws’: {‘bucket’ : ‘Name of Bucket on S3’}
round (integer, default = 4) –
of decimal places the predicted labels will be rounded to. (Number) –
Returns –
-------- –
grid (info) –
---------- –
Warnings –
--------- –
if the estimator passed is created using finalize_model() then the metrics (-) – printed in the information grid maybe misleading as the model is trained on the complete dataset including the test / hold-out set. Once finalize_model() is used, the model is considered ready for deployment and should be used on new unseen datasets only.
-
pycaret.regression.
save_experiment
(experiment_name=None)¶ This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.
save_experiment()
This will save the entire experiment into the current active directory. By default, the name of the experiment will use the session_id generated during setup(). To use a custom name, a string must be passed to the experiment_name param. For example:
save_experiment(‘experiment_23122019’)
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.regression.
save_model
(model, model_name, verbose=True)¶ This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) lr = create_model(‘lr’)
save_model(lr, ‘lr_model_23122019’)
This will save the transformation pipeline and model as a binary pickle file in the current directory.
- Parameters
model (object, default = none) –
trained model object should be passed as an estimator. (A) –
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
verbose (Boolean, default = True) –
message is not printed when verbose is set to False. (Success) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.regression.
setup
(data, target, train_size=0.7, sampling=True, sample_estimator=None, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, high_cardinality_method='frequency', numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_outliers=False, outliers_threshold=0.05, remove_multicollinearity=False, multicollinearity_threshold=0.9, create_clusters=False, cluster_iter=20, polynomial_features=False, polynomial_degree=2, trigonometry_features=False, polynomial_threshold=0.1, group_features=None, group_names=None, feature_selection=False, feature_selection_threshold=0.8, feature_interaction=False, feature_ratio=False, interaction_threshold=0.01, transform_target=False, transform_target_method='box-cox', data_split_shuffle=True, folds_shuffle=False, n_jobs=- 1, html=True, session_id=None, silent=False, verbose=True, profile=False)¶ This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column.
All other parameters are optional.
from pycaret.datasets import get_data boston = get_data(‘boston’)
experiment_name = setup(data = boston, target = ‘medv’)
‘boston’ is a pandas DataFrame and ‘medv’ is the name of target column.
- Parameters
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features. (is) –
target (string) –
of target column to be passed in as string. (Name) –
train_size (float, default = 0.7) –
of the training set. By default, 70% of the data will be used for training (Size) –
validation. The remaining data will be used for test / hold-out set. (and) –
sampling (bool, default = True) –
the sample size exceeds 25,000 samples, pycaret will build a base estimator (When) –
various sample sizes from the original dataset. This will return a performance (at) –
of R2 values at various sample levels, that will assist in deciding the (plot) –
sample size for modeling. The desired sample size must then be entered (preferred) –
training and validation in the pycaret environment. When sample_size entered (for) –
less than 1, the remaining dataset (1 - sample) is used for fitting the model (is) –
when finalize_model() is called. (only) –
sample_estimator (object, default = None) –
None, Linear Regression is used by default. (If) –
categorical_features (string, default = None) –
the inferred data types are not correct, categorical_features can be used to (If) –
the inferred type. If when running setup the type of 'column1' is (overwrite) –
as numeric instead of categorical, then this parameter can be used (inferred) –
overwrite the type by passing categorical_features = ['column1'] (to) –
categorical_imputation (string, default = 'constant') –
missing values are found in categorical features, they will be imputed with (If) –
constant 'not_available' value. The other available option is 'mode' which (a) –
the missing value using most frequent value in the training dataset. (imputes) –
ordinal_features (dictionary, default = None) –
the data contains ordinal features, they must be encoded differently using (When) –
ordinal_features param. If the data has a categorical variable with values (the) –
'low', 'medium', 'high' and it is known that low < medium < high, then it can (of) –
passed as ordinal_features = { 'column_name' (be) –
list sequence must be in increasing order from lowest to highest. (The) –
high_cardinality_features (string, default = None) –
the data containts features with high cardinality, they can be compressed (When) –
fewer levels by passing them as a list of column names with high cardinality. (into) –
are compressed using method defined in high_cardinality_method param. (Features) –
high_cardinality_method (string, default = 'frequency') –
method set to 'frequency' it will replace the original value of feature (When) –
the frequency distribution and convert the feature into numeric. Other (with) –
method is 'clustering' which performs the clustering on statistical (available) –
of data and replaces the original value of feature with cluster label. (attribute) –
number of clusters is determined using a combination of Calinski-Harabasz and (The) –
criterion. (Silhouette) –
numeric_features (string, default = None) –
the inferred data types are not correct, numeric_features can be used to (If) –
the inferred type. If when running setup the type of 'column1' is –
as a categorical instead of numeric, then this parameter can be used (inferred) –
overwrite by passing numeric_features = ['column1'] (to) –
numeric_imputation (string, default = 'mean') –
missing values are found in numeric features, they will be imputed with the (If) –
value of the feature. The other available option is 'median' which imputes (mean) –
value using the median value in the training dataset. (the) –
date_features (string, default = None) –
the data has a DateTime column that is not automatically detected when running (If) –
this parameter can be used by passing date_features = 'date_column_name'. (setup,) –
can work with multiple date columns. Date columns are not used in modeling. (It) –
feature extraction is performed and date columns are dropped from the (Instead,) –
If the date column includes a time stamp, features related to time will (dataset.) –
be extracted. (also) –
ignore_features (string, default = None) –
any feature should be ignored for modeling, it can be passed to the param (If) –
The ID and DateTime columns when inferred, are automatically (ignore_features.) –
to ignore for modeling. (set) –
normalize (bool, default = False) –
set to True, the feature space is transformed using the normalized_method (When) –
Generally, linear algorithms perform better with normalized data however, (param.) –
results may vary and it is advised to run multiple experiments to evaluate (the) –
benefit of normalization. (the) –
normalize_method (string, default = 'zscore') –
the method to be used for normalization. By default, normalize method (Defines) –
set to 'zscore'. The standard zscore is calculated as z = (x - u) / s. The (is) –
available options are (other) –
'minmax' (scales and translates each feature individually such that it is in) – the range of 0 - 1.
'maxabs' (scales and translates each feature individually such that the maximal) – absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
'robust' (scales and translates each feature according to the Interquartile range.) – When the dataset contains outliers, robust scaler often gives better results.
transformation (bool, default = False) –
set to True, a power transformation is applied to make the data more normal / (When) –
This is useful for modeling issues related to heteroscedasticity or (Gaussian-like.) –
situations where normality is desired. The optimal parameter for stabilizing (other) –
and minimizing skewness is estimated through maximum likelihood. (variance) –
transformation_method (string, default = 'yeo-johnson') –
the method for transformation. By default, the transformation method is set (Defines) –
'yeo-johnson'. The other available option is 'quantile' transformation. Both (to) –
transformation transforms the feature set to follow a Gaussian-like or normal (the) –
Note that the quantile transformer is non-linear and may distort linear (distribution.) –
between variables measured at the same scale. (correlations) –
handle_unknown_categorical (bool, default = True) –
set to True, unknown categorical levels in new / unseen data are replaced by (When) –
most or least frequent level as learned in the training data. The method is (the) –
under the unknown_categorical_method param. (defined) –
unknown_categorical_method (string, default = 'least_frequent') –
used to replace unknown categorical levels in unseen data. Method can be (Method) –
to 'least_frequent' or 'most_frequent'. (set) –
pca (bool, default = False) –
set to True, dimensionality reduction is applied to project the data into (When) –
lower dimensional space using the method defined in pca_method param. In (a) –
learning pca is generally performed when dealing with high feature (supervised) –
and memory is a constraint. Note that not all datasets can be decomposed (space) –
using a linear PCA technique and that applying PCA may result in loss (efficiently) –
information. As such, it is advised to run multiple experiments with different (of) –
to evaluate the impact. (pca_methods) –
pca_method (string, default = 'linear') –
'linear' method performs Linear dimensionality reduction using Singular Value (The) –
The other available options are (Decomposition.) –
kernel (dimensionality reduction through the use of RVF kernel.) –
incremental (replacement for 'linear' pca when the dataset to be decomposed is) – too large to fit in memory
pca_components (int/float, default = 0.99) –
of components to keep. if pca_components is a float, it is treated as a (Number) –
percentage for information retention. When pca_components is an integer (target) –
is treated as the number of features to be kept. pca_components must be strictly (it) –
than the original number of features in the dataset. (less) –
ignore_low_variance (bool, default = False) –
set to True, all categorical features with statistically insignificant variances (When) –
removed from the dataset. The variance is calculated using the ratio of unique (are) –
to the number of samples, and the ratio of the most common value to the (values) –
of the second most common value. (frequency) –
combine_rare_levels (bool, default = False) –
set to True, all levels in categorical features below the threshold defined (When) –
rare_level_threshold param are combined together as a single level. There must be (in) –
two levels under the threshold for this to take effect. rare_level_threshold (atleast) –
the percentile distribution of level frequency. Generally, this technique (represents) –
applied to limit a sparse matrix caused by high numbers of levels in categorical (is) –
features. –
rare_level_threshold (float, default = 0.1) –
distribution below which rare categories are combined. Only comes into (Percentile) –
when combine_rare_levels is set to True. (effect) –
bin_numeric_features (list, default = None) –
a list of numeric features is passed they are transformed into categorical (When) –
using KMeans, where values in each bin have the same nearest center of a (features) –
k-means cluster. The number of clusters are determined based on the 'sturges' (1D) –
It is only optimal for gaussian data and underestimates the number of bins (method.) –
large non-gaussian datasets. (for) –
remove_outliers (bool, default = False) –
set to True, outliers from the training data are removed using PCA linear (When) –
reduction using the Singular Value Decomposition technique. (dimensionality) –
outliers_threshold (float, default = 0.05) –
percentage / proportion of outliers in the dataset can be defined using (The) –
outliers_threshold param. By default, 0.05 is used which means 0.025 of the (the) –
on each side of the distribution's tail are dropped from training data. (values) –
remove_multicollinearity (bool, default = False) –
set to True, the variables with inter-correlations higher than the threshold (When) –
under the multicollinearity_threshold param are dropped. When two features (defined) –
highly correlated with each other, the feature that is less correlated with (are) –
target variable is dropped. (the) –
multicollinearity_threshold (float, default = 0.9) –
used for dropping the correlated features. Only comes into effect when (Threshold) –
is set to True. (remove_multicollinearity) –
create_clusters (bool, default = False) –
set to True, an additional feature is created where each instance is assigned (When) –
a cluster. The number of clusters is determined using a combination of (to) –
and Silhouette criterion. (Calinski-Harabasz) –
cluster_iter (int, default = 20) –
of iterations used to create a cluster. Each iteration represents cluster (Number) –
Only comes into effect when create_clusters param is set to True. (size.) –
polynomial_features (bool, default = False) –
set to True, new features are created based on all polynomial combinations (When) –
exist within the numeric features in a dataset to the degree defined in (that) –
param. (polynomial_degree) –
polynomial_degree (int, default = 2) –
of polynomial features. For example, if an input sample is two dimensional (Degree) –
of the form [a, b], the polynomial features with degree = 2 are (and) –
a, b, a^2, ab, b^2] ([1,) –
trigonometry_features (bool, default = False) –
set to True, new features are created based on all trigonometric combinations (When) –
exist within the numeric features in a dataset to the degree defined in the (that) –
param. –
polynomial_threshold (float, default = 0.1) –
is used to compress a sparse matrix of polynomial and trigonometric features. (This) –
and trigonometric features whose feature importance based on the (Polynomial) –
of Random Forest, AdaBoost and Linear correlation falls within the (combination) –
of the defined threshold are kept in the dataset. Remaining features (percentile) –
dropped before further processing. (are) –
group_features (list or list of list, default = None) –
a dataset contains features that have related characteristics, the group_features (When) –
can be used for statistical feature extraction. For example, if a dataset has (param) –
features that are related with each other (i.e 'Col1', 'Col2', 'Col3'), a list (numeric) –
the column names can be passed under group_features to extract statistical (containing) –
such as the mean, median, mode and standard deviation. (information) –
group_names (list, default = None) –
group_features is passed, a name of the group can be passed into the group_names (When) –
as a list containing strings. The length of a group_names list must equal to the (param) –
of group_features. When the length doesn't match or the name is not passed, new (length) –
are sequentially named such as group_1, group_2 etc. (features) –
feature_selection (bool, default = False) –
set to True, a subset of features are selected using a combination of various (When) –
importance techniques including Random Forest, Adaboost and Linear (permutation) –
with target variable. The size of the subset is dependent on the (correlation) –
Generally, this is used to constrain the feature space (feature_selection_param.) –
order to improve efficiency in modeling. When polynomial_features and (in) –
are used, it is highly recommended to define the (feature_interaction) –
param with a lower value. (feature_selection_threshold) –
feature_selection_threshold (float, default = 0.8) –
used for feature selection (including newly created polynomial features) (Threshold) –
higher value will result in a higher feature space. It is recommended to do multiple (A) –
with different values of feature_selection_threshold specially in cases where (trials) –
and feature_interaction are used. Setting a very low value may be (polynomial_features) –
but could result in under-fitting. (efficient) –
feature_interaction (bool, default = False) –
set to True, it will create new features by interacting (a * b) for all numeric (When) –
in the dataset including polynomial and trigonometric features (if created) (variables) –
feature is not scalable and may not work as expected on datasets with large (This) –
space. (feature) –
feature_ratio (bool, default = False) –
set to True, it will create new features by calculating the ratios (a / b) of all (When) –
variables in the dataset. This feature is not scalable and may not work as (numeric) –
on datasets with large feature space. (expected) –
interaction_threshold (bool, default = 0.01) –
to polynomial_threshold, It is used to compress a sparse matrix of newly (Similar) –
features through interaction. Features whose importance based on the (created) –
of Random Forest, AdaBoost and Linear correlation falls within the (combination) –
of the defined threshold are kept in the dataset. Remaining features (percentile) –
dropped before further processing. –
transform_target (bool, default = False) –
set to True, target variable is transformed using the method defined in (When) –
param. Target transformation is applied separately from (transform_target_method) –
transformations. (feature) –
transform_target_method (string, default = 'box-cox') –
and 'yeo-johnson' methods are supported. Box-Cox requires input data to ('Box-cox') –
strictly positive, while Yeo-Johnson supports both positive or negative data. (be) –
transform_target_method is 'box-cox' and target variable contains negative (When) –
method is internally forced to 'yeo-johnson' to avoid exceptions. (values,) –
data_split_shuffle (bool, default = True) –
set to False, prevents shuffling of rows when splitting data (If) –
folds_shuffle (bool, default = True) –
set to False, prevents shuffling of rows when using cross validation (If) –
n_jobs (int, default = -1) –
number of jobs to run in parallel (for functions that supports parallel (The) –
-1 means using all processors. To run all functions on single processor (processing)) –
n_jobs to None. (set) –
html (bool, default = True) –
set to False, prevents runtime display of monitor. This must be set to False (If) –
using environment that doesnt support HTML. (when) –
session_id (int, default = None) –
None, a random seed is generated and returned in the Information grid. The (If) –
number is then distributed as a seed in all functions used during the (unique) –
This can be used for later reproducibility of the entire experiment. (experiment.) –
silent (bool, default = False) –
set to True, confirmation of data types is not required. All preprocessing will (When) –
performed assuming automatically inferred data types. Not recommended for direct use (be) –
for established pipelines. (except) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Information) –
profile (bool, default = False) –
set to true, a data profile for Exploratory Data Analysis will be displayed (If) –
an interactive HTML report. (in) –
Returns –
-------- –
grid (info) –
----------- –
environment (This function returns various outputs that are stored in variable) –
as tuple. They are used by other functions in pycaret. (-----------) –
Warnings –
--------- –
None –
-
pycaret.regression.
stack_models
(estimator_list, meta_model=None, fold=10, round=4, restack=True, plot=False, improve_only=True, optimize='r2', finalize=False, verbose=True)¶ This function creates a meta model and scores it using Kfold Cross Validation. The predictions from the base level models as passed in the estimator_list param are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False).
The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Folds).
This function returns a container which is the list of all models in stacking.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’) dt = create_model(‘dt’) rf = create_model(‘rf’) ada = create_model(‘ada’) ridge = create_model(‘ridge’) knn = create_model(‘knn’)
stacked_models = stack_models(estimator_list=[dt,rf,ada,ridge,knn])
This will create a meta model that will use the predictions of all the models provided in estimator_list param. By default, the meta model is Linear Regression but can be changed with meta_model param.
- Parameters
estimator_list (list of object) –
meta_model (object, default = None) –
set to None, Linear Regression is used as a meta model. (if) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
restack (Boolean, default = True) –
restack is set to True, raw data will be exposed to meta model when (When) –
predictions, otherwise when False, only the predicted label is passed (making) –
meta model when making final predictions. (to) –
plot (Boolean, default = False) –
plot is set to True, it will return the correlation plot of prediction (When) –
all base models provided in estimator_list. (from) –
improve_only (Boolean, default = True) –
set to True, base estimator is returned when the metric doesn't (When) –
by ensemble_model. This gurantees the returned object would perform (improve) –
equivalent to base estimator created using create_model or model (atleast) –
by compare_models. (returned) –
optimize (string, default = 'r2') –
used when improve_only is set to True. optimize parameter is used (Only) –
compare emsembled model with base estimator. Values accepted in (to) –
parameter are 'mae', 'mse', 'rmse', 'r2', 'rmsle', 'mape'. (optimize) –
finalize (Boolean, default = False) –
finalize is set to True, it will fit the stacker on entire dataset (When) –
the hold-out sample created during the setup() stage. It is not (including) –
to set this to True here, If you would like to fit the stacker (recommended) –
the entire dataset including the hold-out, use finalize_model() (on) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. (-----------) – Mean and standard deviation of the scores across the folds are also returned.
container (list of all the models where last element is meta model.) –
---------- –
Warnings –
--------- –
None –
-
pycaret.regression.
tune_model
(estimator, fold=10, round=4, n_iter=10, custom_grid=None, optimize='r2', verbose=True, improve_only=True)¶ This function tunes the hyperparameters of a model and scores it using Kfold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds).
This function returns a trained model object.
tune_model() only accepts a string parameter for estimator.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, target = ‘medv’)
tuned_xgboost = tune_model(‘xgboost’)
This will tune the hyperparameters of Extreme Gradient Boosting Regressor.
- Parameters
estimator (string, default = None) –
abbreviated name of the estimator class. List of estimators supported (Enter) –
Abbreviated String Original Implementation (Estimator) –
------------------ ----------------------- (---------) –
Regression 'lr' linear_model.LinearRegression (Linear) –
Regression 'lasso' linear_model.Lasso (Lasso) –
Regression 'ridge' linear_model.Ridge (Ridge) –
Net 'en' linear_model.ElasticNet (Elastic) –
Angle Regression 'lar' linear_model.Lars (Least) –
Least Angle Regression 'llar' linear_model.LassoLars (Lasso) –
Matching Pursuit 'omp' linear_model.OMP (Orthogonal) –
Ridge 'br' linear_model.BayesianRidge (Bayesian) –
Relevance Determ. 'ard' linear_model.ARDRegression (Automatic) –
Aggressive Regressor 'par' linear_model.PAR (Passive) –
Sample Consensus 'ransac' linear_model.RANSACRegressor (Random) –
Regressor 'tr' linear_model.TheilSenRegressor (TheilSen) –
Regressor 'huber' linear_model.HuberRegressor (Huber) –
Ridge 'kr' kernel_ridge.KernelRidge (Kernel) –
Vector Machine 'svm' svm.SVR (Support) –
Neighbors Regressor 'knn' neighbors.KNeighborsRegressor (K) –
Tree 'dt' tree.DecisionTreeRegressor (Decision) –
Forest 'rf' ensemble.RandomForestRegressor (Random) –
Trees Regressor 'et' ensemble.ExtraTreesRegressor (Extra) –
Regressor 'ada' ensemble.AdaBoostRegressor (AdaBoost) –
Boosting 'gbr' ensemble.GradientBoostingRegressor (Gradient) –
Level Perceptron 'mlp' neural_network.MLPRegressor (Multi) –
Gradient Boosting 'xgboost' xgboost.readthedocs.io (Extreme) –
Gradient Boosting 'lightgbm' github.com/microsoft/LightGBM (Light) –
Regressor 'catboost' https (CatBoost) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
round (integer, default = 4) –
of decimal places the metrics in the score grid will be rounded to. (Number) –
n_iter (integer, default = 10) –
of iterations within the Random Grid Search. For every iteration, (Number) –
model randomly selects one value from the pre-defined grid of hyperparameters. (the) –
optimize (string, default = 'r2') –
used to select the best model through hyperparameter tuning. (Measure) –
default scoring measure is 'r2'. Other measures include 'mae', 'mse', 'rmse', (The) –
'mape'. When using 'rmse' or 'rmsle' the base scorer is 'mse' and when using ('rmsle',) –
the base scorer is 'mae'. ('mape') –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Score) –
improve_only (Boolean, default = True) –
set to set to True, base estimator is returned when the metric doesn't improve (When) –
tune_model. This gurantees the returned object would perform atleast equivalent (by) –
base estimator created using create_model or model returned by compare_models. (to) –
Returns –
-------- –
grid (score) –
Scoring metrics used are MAE, MSE, RMSE, R2, RMSLE and MAPE. (-----------) – Mean and standard deviation of the scores across the folds are also returned.
model (trained model object) –
----------- –
Warnings –
--------- –
estimator parameter takes an abbreviated string. Passing a trained model object (-) – returns an error. The tune_model() function internally calls create_model() before tuning the hyperparameters.
NLP¶
-
pycaret.nlp.
assign_model
(model, verbose=True)¶ This function assigns each of the data point in the dataset passed during setup stage to one of the topic using trained model object passed as model param. create_model() function must be called before using assign_model().
This function returns dataframe with topic weights, dominant topic and % of the dominant topic (where applicable).
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’) lda = create_model(‘lda’)
lda_df = assign_model(lda)
This will return a dataframe with inferred topics using trained model.
- Parameters
model (trained model object, default = None) –
verbose (Boolean, default = True) –
update is not printed when verbose is set to False. (Status) –
Returns –
-------- –
dataframe (Returns dataframe with inferred topics using trained model object.) –
--------- –
Warnings –
--------- –
None –
-
pycaret.nlp.
create_model
(model=None, multi_core=False, num_topics=None, verbose=True)¶ This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model().
This function returns a trained model object.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’)
lda = create_model(‘lda’)
This will return trained Latent Dirichlet Allocation model.
- Parameters
model (trained model object) –
abbreviated string of the model class. List of models supported (Enter) –
Abbreviated String Original Implementation (Model) –
------------------ ----------------------- (---------) –
Dirichlet Allocation 'lda' gensim/models/ldamodel.html (Latent) –
Semantic Indexing 'lsi' gensim/models/lsimodel.html (Latent) –
Dirichlet Process 'hdp' gensim/models/hdpmodel.html (Hierarchical) –
Projections 'rp' gensim/models/rpmodel.html (Random) –
Matrix Factorization 'nmf' sklearn.decomposition.NMF.html (Non-Negative) –
multi_core (Boolean, default = False) –
would utilize all CPU cores to parallelize and speed up model training. Only (True) –
for 'lda'. For all other models, the multi_core parameter is ignored. (available) –
num_topics (integer, default = 4) –
of topics to be created. If None, default is set to 4. (Number) –
verbose (Boolean, default = True) –
update is not printed when verbose is set to False. (Status) –
Returns –
-------- –
model –
------ –
Warnings –
--------- –
None –
-
pycaret.nlp.
evaluate_model
(model)¶ This function displays the user interface for all the available plots for a given model. It internally uses the plot_model() function.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’) lda = create_model(‘lda’)
evaluate_model(lda)
This will display the User Interface for all of the plots for given model.
- Parameters
model (object, default = none) –
trained model object should be passed. (A) –
Returns –
-------- –
Interface (User) –
-------------- –
Warnings –
--------- –
None –
-
pycaret.nlp.
get_topics
(data, text, model=None, num_topics=4)¶ Magic function to get topic model in Power Query / Power BI.
-
pycaret.nlp.
load_experiment
(experiment_name)¶ This function loads a previously saved experiment from the current active directory into current python environment. Load object must be a pickle file.
saved_experiment = load_experiment(‘experiment_23122019’)
This will load the entire experiment pipeline into the object saved_experiment. The experiment file must be in current directory.
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Grid containing details of saved objects in experiment pipeline. (Information) –
Warnings –
--------- –
None –
-
pycaret.nlp.
load_model
(model_name)¶ This function loads a previously saved model from the current active directory into the current python environment. Load object must be a pickle file.
saved_lda = load_model(‘lda_model_23122019’)
This will call the trained model in saved_lr variable using model_name param. The file must be in current directory.
- Parameters
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.nlp.
plot_model
(model=None, plot='frequency', topic_num=None)¶ This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’) lda = create_model(‘lda’)
plot_model(lda, plot = ‘frequency’)
This will return a frequency plot on a trained Latent Dirichlet Allocation model for all documents in ‘Topic 0’. The topic number can be changed as follows:
plot_model(lda, plot = ‘frequency’, topic_num = ‘Topic 1’)
This will now return a frequency plot on a trained LDA model for all documents inferred in ‘Topic 1’.
Alternatively, if following is used:
plot_model(plot = ‘frequency’)
This will return frequency plot on the entire training corpus compiled during setup stage.
- Parameters
model (object, default = none) –
trained model object can be passed. Model must be created using create_model() (A) –
plot (string, default = 'frequency') –
abbreviation for type of plot. The current list of plots supported are (Enter) –
Abbreviated String (Name) –
------------------ (---------) –
Token Frequency 'frequency' (Word) –
Distribution Plot 'distribution' (Word) –
Frequency Plot 'bigram' (Bigram) –
Frequency Plot 'trigram' (Trigram) –
Polarity Plot 'sentiment' (Sentiment) –
of Speech Frequency 'pos' (Part) –
(3d) Dimension Plot 'tsne' (t-SNE) –
Model (pyLDAvis) 'topic_model' (Topic) –
Infer Distribution 'topic_distribution' (Topic) –
'wordcloud' (Wordcloud) –
Dimensionality Plot 'umap' (UMAP) –
topic_num (string, default = None) –
number to be passed as a string. If set to None, default generation will (Topic) –
on 'Topic 0' (be) –
Returns –
-------- –
Plot (Visual) –
------------ –
Warnings –
--------- –
'pos' and 'umap' plot not available at model level. Hence the model parameter is (-) – ignored. The result will always be based on the entire training corpus.
'topic_model' plot is based on pyLDAVis implementation. Hence its not available (-) – for model = ‘lsi’, ‘rp’ and ‘nmf’.
-
pycaret.nlp.
save_experiment
(experiment_name=None)¶ This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.
save_experiment()
This will save the entire experiment into the current active directory. By default, the name of the experiment will use the session_id generated during setup(). To use a custom name, a string must be passed to the experiment_name param. For example:
save_experiment(‘experiment_23122019’)
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.nlp.
save_model
(model, model_name)¶ This function saves the trained model object into the current active directory as a pickle file for later use.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’) lda = create_model(‘lda’)
save_model(lda, ‘lda_model_23122019’)
This will save the model as a binary pickle file in the current directory.
- Parameters
model (object, default = none) –
trained model object should be passed. (A) –
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.nlp.
setup
(data, target=None, custom_stopwords=None, session_id=None)¶ This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix} or object of type list. If a dataframe is passed, target column containing text must be specified. When data passed is of type list, no target parameter is required. All other parameters are optional. This module only supports English Language at this time.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’)
‘kiva’ is a pandas Dataframe.
- Parameters
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features or object of type (is) –
with n length. (list) –
target (string) –
data is of type DataFrame, name of column containing text values must be passed as (If) –
string. –
custom_stopwords (list, default = None) –
containing custom stopwords. (list) –
session_id (int, default = None) –
None, a random seed is generated and returned in the Information grid. The (If) –
number is then distributed as a seed in all functions used during the (unique) –
This can be used for later reproducibility of the entire experiment. (experiment.) –
environment: This function returns various outputs that are stored in variable ———– as tuple. They are used by other functions in pycaret.
Some functionalities in pycaret.nlp requires you to have english language model. The language model is not downloaded automatically when you install pycaret. You will have to download two models using your Anaconda Prompt or python command line interface. To download the model, please type the following in your command line:
python -m spacy download en_core_web_sm python -m textblob.download_corpora
Once downloaded, please restart your kernel and re-run the setup.
-
pycaret.nlp.
tune_model
(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, auto_fe=True, fold=10)¶ This function tunes the num_topics model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.
This function returns the tuned model object.
from pycaret.datasets import get_data kiva = get_data(‘kiva’) experiment_name = setup(data = kiva, target = ‘en’)
tuned_lda = tune_model(model = ‘lda’, supervised_target = ‘status’)
This will return trained Latent Dirichlet Allocation model.
- Parameters
model (trained model object with best K number of topics.) –
abbreviated name of the model. List of available models supported (Enter) –
Abbreviated String Original Implementation (Model) –
------------------ ----------------------- (---------) –
Dirichlet Allocation 'lda' gensim/models/ldamodel.html (Latent) –
Semantic Indexing 'lsi' gensim/models/lsimodel.html (Latent) –
Dirichlet Process 'hdp' gensim/models/hdpmodel.html (Hierarchical) –
Projections 'rp' gensim/models/rpmodel.html (Random) –
Matrix Factorization 'nmf' sklearn.decomposition.NMF.html (Non-Negative) –
multi_core (Boolean, default = False) –
would utilize all CPU cores to parallelize and speed up model training. Only (True) –
for 'lda'. For all other models, multi_core parameter is ignored. (available) –
supervised_target (string) –
of the target column for supervised learning. If None, the mdel coherence value (Name) –
used as the objective function. (is) –
estimator (string, default = None) –
Abbreviated String Task (Estimator) –
------------------ --------------- (---------) –
Regression 'lr' Classification (Logistic) –
Nearest Neighbour 'knn' Classification (K) –
Bayes 'nb' Classification (Naives) –
Tree 'dt' Classification (Decision) –
(Linear) 'svm' Classification (SVM) –
(RBF) 'rbfsvm' Classification (SVM) –
Process 'gpc' Classification (Gaussian) –
Level Perceptron 'mlp' Classification (Multi) –
Classifier 'ridge' Classification (Ridge) –
Forest 'rf' Classification (Random) –
Disc. Analysis 'qda' Classification (Quadratic) –
'ada' Classification (AdaBoost) –
Boosting 'gbc' Classification (Gradient) –
Disc. Analysis 'lda' Classification (Linear) –
Trees Classifier 'et' Classification (Extra) –
Gradient Boosting 'xgboost' Classification (Extreme) –
Gradient Boosting 'lightgbm' Classification (Light) –
Classifier 'catboost' Classification (CatBoost) –
Regression 'lr' Regression (Linear) –
Regression 'lasso' Regression (Lasso) –
Regression 'ridge' Regression (Ridge) –
Net 'en' Regression (Elastic) –
Angle Regression 'lar' Regression (Least) –
Least Angle Regression 'llar' Regression (Lasso) –
Matching Pursuit 'omp' Regression (Orthogonal) –
Ridge 'br' Regression (Bayesian) –
Relevance Determ. 'ard' Regression (Automatic) –
Aggressive Regressor 'par' Regression (Passive) –
Sample Consensus 'ransac' Regression (Random) –
Regressor 'tr' Regression (TheilSen) –
Regressor 'huber' Regression (Huber) –
Ridge 'kr' Regression (Kernel) –
Vector Machine 'svm' Regression (Support) –
Neighbors Regressor 'knn' Regression (K) –
Tree 'dt' Regression (Decision) –
Forest 'rf' Regression (Random) –
Trees Regressor 'et' Regression (Extra) –
Regressor 'ada' Regression (AdaBoost) –
Boosting 'gbr' Regression (Gradient) –
Level Perceptron 'mlp' Regression (Multi) –
Gradient Boosting 'xgboost' Regression (Extreme) –
Gradient Boosting 'lightgbm' Regression (Light) –
Regressor 'catboost' Regression (CatBoost) –
set to None, Linear model is used by default for both classification (If) –
regression tasks. (and) –
optimize (string, default = None) –
Classification tasks (For) –
AUC, Recall, Precision, F1, Kappa (Accuracy,) –
Regression tasks (For) –
MSE, RMSE, R2, ME (MAE,) –
set to None, default is 'Accuracy' for classification and 'R2' for (If) –
tasks. (regression) –
auto_fe (boolean, default = True) –
text feature engineering. Only used when supervised_target is (Automatic) –
When set to true, it will generate text based features such as (passed.) –
subjectivity, wordcounts to be used in supervised learning. (polarity,) –
when supervised_target is set to None. (Ignored) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
Returns –
-------- –
plot (visual) –
optimize on y-axis. Coherence is used when learning is (-----------) – unsupervised. Also, prints the best model metric.
model –
----------- –
Warnings –
--------- –
Random Projections ('rp') and Non Negative Matrix Factorization ('nmf') (-) – is not available for unsupervised learning. Error is raised when ‘rp’ or ‘nmf’ is passed without supervised_target.
Estimators using kernel based methods such as Kernel Ridge Regressor, (-) – Automatic Relevance Determinant, Gaussian Process Classifier, Radial Basis Support Vector Machine and Multi Level Perceptron may have longer training times.
Preprocess¶
-
class
pycaret.preprocess.
Advanced_Feature_Selection_Classic
(target, ml_usecase='classification', top_features_to_pick=0.1, random_state=42, subclass='ignore')¶ Selects important features and reduces the feature space. Feature selection is based on Random Forest , Light GBM and Correlation
to run on multiclass classification , set the subclass argument to ‘multi’
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Binning
(features_to_discretize)¶ Converts numerical variables to catagorical variable through binning
Number of binns are automitically determined through Sturges method
- Once discretize, original feature will be dropped
- Args:
features_to_discretize: list of featur names to be binned
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Catagorical_variables_With_Rare_levels
(target, new_level_name='others_infrequent', threshold=0.05)¶ - -Merges levels in catagorical features with more frequent level if they appear less than a threshold count
e.g. Col=[a,a,a,a,b,b,c,c] if threshold is set to 2 , then c will be mrged with b because both are below threshold There has to be atleast two levels belwo threshold for this to work the process will keep going until all the levels have atleast 2(threshold) counts
-Only handles catagorical features -It is recommended to run the Zroe_NearZero_Variance and Define_dataTypes first -Ignores target variable
- Args:
threshold: int , default 10 target: string , name of the target variable new_level_name: string , name given to the new level generated, default ‘others’
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Clean_Colum_Names
¶ Cleans special chars that are not supported by jason format
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Cluster_Entire_Data
(target_variable, check_clusters_upto=20, random_state=42)¶ Applies kmeans clustering to the entire data set and produce clusters
Highly recommended to run the DataTypes_Auto_infer class first Args:
target_variable: target variable (integer or numerical only) check_clusters_upto: to determine optimum number of kmeans clusters, set the uppler limit of clusters
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
DFS_Classic
(target, ml_usecase='classification', interactions=['multiply', 'divide', 'add', 'subtract'], top_features_to_pick_percentage=0.05, random_state=42, subclass='ignore')¶ Automated feature interactions using multiplication, division , addition & substraction
Only accepts numeric / One Hot Encoded features
Takes DF, return same DF
for Multiclass classification problem , set subclass arg as ‘multi’
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
DataTypes_Auto_infer
(target, ml_usecase, categorical_features=[], numerical_features=[], time_features=[], features_todrop=[], display_types=True)¶ This will try to infer data types automatically, option to override learent data types is also available.
This alos automatically delets duplicate columns (values or same colume name), removes rows where target variable is null and remove columns and rows where all the records are null
-
fit
(dataset, y=None)¶ - Parameters
data – accepts a pandas data frame
- Returns
Panda Data Frame
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
transform
(dataset, y=None)¶ - Parameters
data – accepts a pandas data frame
- Returns
Panda Data Frame
-
class
pycaret.preprocess.
Dummify
(target)¶ makes one hot encoded variables for dummy variable
it is HIGHLY recommended to run the Select_Data_Type class first
Ignores target variable
- Args:
target: string , name of the target variable
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Empty
¶ Takes DF, return same DF
-
fit_transform
(data, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Fix_multicollinearity
(threshold, target_variable, correlation_with_target_threshold=0.0, correlation_with_target_preference=1.0)¶ Fixes multicollinearity between predictor variables , also considering the correlation between target variable. Only applies to regression or two class classification ML use case Takes numerical and one hot encoded variables only
- Args:
threshold (float): The utmost absolute pearson correlation tolerated beyween featres from 0.0 to 1.0 target_variable (str): The target variable/column name correlation_with_target_threshold: minimum absolute correlation required between every feature and the target variable , default 1.0 (0.0 to 1.0) correlation_with_target_preference: float (0.0 to 1.0), default .08 ,while choosing between a pair of features w.r.t multicol & correlation target , this gives the option to favour one measur to another. e.g. if value is .6 , during feature selection tug of war, correlation target measure will have a higher say. A value of .5 means both measure have equal say
-
fit
(data, y=None)¶ - Parameters
= takes preprocessed data frame (data) –
- Returns
None
-
fit_transform
(data, y=None)¶ - Parameters
= takes preprocessed data frame (data) –
- Returns
data frame
-
transform
(dataset, y=None)¶ - Args:f
data = takes preprocessed data frame
- Returns
data frame
-
class
pycaret.preprocess.
Group_Similar_Features
(group_name=[], list_of_grouped_features=[[]])¶ Given a list of features , it creates aggregate features
features created are Min, Max, Mean, Median, Mode & Std
Only works on numerical features Args:
list_of_similar_features: list of list, string , e.g. [[‘col’,col2],[‘col3’,’col4’]] group_name: list, group name/names to be added as prefix to aggregate features, e.g [‘gorup1’,’group2’]
-
fit_transform
(data, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Make_NonLiner_Features
(target, ml_usecase='classification', Polynomial_degree=2, other_nonliner_features=['sin', 'cos', 'tan'], top_features_to_pick=0.2, random_state=42, subclass='ignore')¶ convert numerical features into polynomial features
it is HIGHLY recommended to run the Autoinfer_Data_Type class first
Ignores target variable
it picks up data type float64 as numerical
for multiclass classification problem , set subclass arg to ‘multi’
- Args:
target: string , name of the target variable Polynomial_degree: int ,default 2
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Make_Time_Features
(time_feature=[], list_of_features=['month', 'weekday', 'is_month_end', 'is_month_start', 'hour'])¶ -Given a time feature , it extracts more features - Only accepts / works where feature / data type is datetime64[ns] - full list of features is:
[‘month’,’weekday’,is_month_end’,’is_month_start’,’hour’]
all extracted features are defined as string / object
- -it is recommended to run Define_dataTypes first
- Args:
time_feature: list of feature names as datetime64[ns] , default empty/none , if empty/None , it will try to pickup dates automatically where data type is datetime64[ns] list_of_features: list of required features , default value [‘month’,’weekday’,’is_month_end’,’is_month_start’,’hour’]
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
New_Catagorical_Levels_in_TestData
(target, replacement_strategy='most frequent')¶ -This treats if a new level appears in the test dataset catagorical’s feature (i.e a level on whihc model was not trained previously) -It simply replaces the new level in test data set with the most frequent or least frequent level in the same feature in the training data set -It is recommended to run the Zroe_NearZero_Variance and Define_dataTypes first -Ignores target variable
- Args:
target: string , name of the target variable replacement_strategy:string , ‘least frequent’ or ‘most frequent’ (default ‘most frequent’ )
-
fit_transform
(data, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Ordinal
(info_as_dict)¶ converts categorical features into ordinal values
takes a dataframe , and information about column names and ordered categories as dict
returns float panda data frame
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Outlier
(target, contamination=0.2, random_state=42, methods=['knn', 'iso', 'pca'])¶ Removes outlier using ABOD,KNN,IFO,PCA & HOBS using hard voting
Only takes numerical / One Hot Encoded features
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
pycaret.preprocess.
Preprocess_Path_One
(train_data, target_variable, ml_usecase=None, test_data=None, categorical_features=[], numerical_features=[], time_features=[], features_todrop=[], display_types=True, imputation_type='simple imputer', numeric_imputation_strategy='mean', categorical_imputation_strategy='not_available', apply_zero_nearZero_variance=False, club_rare_levels=False, rara_level_threshold_percentage=0.05, apply_untrained_levels_treatment=False, untrained_levels_treatment_method='least frequent', apply_ordinal_encoding=False, ordinal_columns_and_categories={}, apply_cardinality_reduction=False, cardinal_method='cluster', cardinal_features=[], apply_binning=False, features_to_binn=[], apply_grouping=False, group_name=[], features_to_group_ListofList=[[]], apply_polynomial_trigonometry_features=False, max_polynomial=2, trigonometry_calculations=['sin', 'cos', 'tan'], top_poly_trig_features_to_select_percentage=0.2, scale_data=False, scaling_method='zscore', Power_transform_data=False, Power_transform_method='quantile', target_transformation=False, target_transformation_method='bc', remove_outliers=False, outlier_contamination_percentage=0.01, outlier_methods=['pca', 'iso', 'knn'], apply_feature_selection=False, feature_selection_top_features_percentage=0.8, remove_multicollinearity=False, maximum_correlation_between_features=0.9, remove_perfect_collinearity=False, apply_feature_interactions=False, feature_interactions_to_apply=['multiply', 'divide', 'add', 'subtract'], feature_interactions_top_features_to_select_percentage=0.01, cluster_entire_data=False, range_of_clusters_to_try=20, apply_pca=False, pca_method='pca_liner', pca_variance_retained_or_number_of_components=0.99, random_state=42)¶ - Follwoing preprocess steps are taken:
Auto infer data types
Impute (simple or with surrogate columns)
Ordinal Encoder
Drop categorical variables that have zero variance or near zero variance
Club categorical variables levels togather as a new level (other_infrequent) that are rare / at the bottom 5% of the variable distribution
Club unseen levels in test dataset with most/least frequent levels in train dataset
Reduce high cardinality in categorical features using clustering or counts
Generate sub features from time feature such as ‘month’,’weekday’,is_month_end’,’is_month_start’ & ‘hour’
Group features by calculating min, max, mean, median & sd of similar features
-10) Make nonliner features (polynomial, sin , cos & tan) -11) Scales & Power Transform (zscore,minmax,yeo-johnson,quantile,maxabs,robust) , including option to transform target variable -12) Apply binning to continious variable when numeric features are provided as a list -13) Detect & remove outliers using isolation forest, knn and PCA -14) Apply clusters to segment entire data -15) One Hot / Dummy encoding -16) Remove special characters from column names such as commas, square brackets etc to make it competible with jason dependednt models -17) Feature Selection throuh Random Forest , LightGBM and Pearson Correlation -18) Fix multicollinearity -19) Feature Interaction (DFS) , multiply , divided , add and substract features -20) Apply diamension reduction techniques such as pca_liner, pca_kernal, incremental, tsne
except for pca_liner, all other method only takes number of component (as integer) i.e no variance explaination metohd available
-
pycaret.preprocess.
Preprocess_Path_Two
(train_data, ml_usecase=None, test_data=None, categorical_features=[], numerical_features=[], time_features=[], features_todrop=[], display_types=False, imputation_type='simple imputer', numeric_imputation_strategy='mean', categorical_imputation_strategy='not_available', apply_zero_nearZero_variance=False, club_rare_levels=False, rara_level_threshold_percentage=0.05, apply_untrained_levels_treatment=False, untrained_levels_treatment_method='least frequent', apply_cardinality_reduction=False, cardinal_method='cluster', cardinal_features=[], apply_ordinal_encoding=False, ordinal_columns_and_categories={}, apply_binning=False, features_to_binn=[], apply_grouping=False, group_name=[], features_to_group_ListofList=[[]], scale_data=False, scaling_method='zscore', Power_transform_data=False, Power_transform_method='quantile', remove_outliers=False, outlier_contamination_percentage=0.01, outlier_methods=['pca', 'iso', 'knn'], remove_multicollinearity=False, maximum_correlation_between_features=0.9, remove_perfect_collinearity=False, apply_pca=False, pca_method='pca_liner', pca_variance_retained_or_number_of_components=0.99, random_state=42)¶ - Follwoing preprocess steps are taken:
THIS IS BUILt FOR UNSUPERVISED LEARNING
Auto infer data types
Impute (simple or with surrogate columns)
Ordinal Encoder
Drop categorical variables that have zero variance or near zero variance
Club categorical variables levels togather as a new level (other_infrequent) that are rare / at the bottom 5% of the variable distribution
Club unseen levels in test dataset with most/least frequent levels in train dataset
Reduce high cardinality in categorical features using clustering or counts
Generate sub features from time feature such as ‘month’,’weekday’,is_month_end’,’is_month_start’ & ‘hour’
Group features by calculating min, max, mean, median & sd of similar features
-10) Scales & Power Transform (zscore,minmax,yeo-johnson,quantile,maxabs,robust) , including option to transform target variable -11) Apply binning to continious variable when numeric features are provided as a list -12) Detect & remove outliers using isolation forest, knn and PCA -13) One Hot / Dummy encoding -14) Remove special characters from column names such as commas, square brackets etc to make it competible with jason dependednt models -15) Fix multicollinearity -16) Apply diamension reduction techniques such as pca_liner, pca_kernal, incremental, tsne
except for pca_liner, all other method only takes number of component (as integer) i.e no variance explaination metohd available
-
class
pycaret.preprocess.
Reduce_Cardinality_with_Clustering
(target_variable, catagorical_feature=[], check_clusters_upto=30, random_state=42)¶ Reduces the level of catagorical column / cardinality through clustering
Highly recommended to run the DataTypes_Auto_infer class first Args:
target_variable: target variable (integer or numerical only) catagorical_feature: list of features on which clustering is to be applied / cardinality to be reduced check_clusters_upto: to determine optimum number of kmeans clusters, set the uppler limit of clusters
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Reduce_Cardinality_with_Counts
(catagorical_feature=[])¶ Reduces the level of catagorical column by replacing levels with their count & converting objects into float Args:
catagorical_feature: list of features on which clustering is to be applied
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Reduce_Dimensions_For_Supervised_Path
(target, method='pca_liner', variance_retained_or_number_of_components=0.99, random_state=42)¶ Takes DF, return same DF with different types of dimensionality reduction modles (pca_liner , pca_kernal, tsne , pls, incremental)
except pca_liner, every other method takes integer as number of components
only takes numeric variables (float & One Hot Encoded)
it is intended to solve supervised ML usecases , such as classification / regression
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Remove_100
(target)¶ Takes DF, return data frame while removing features that are perfectly correlated (droping one)
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Scaling_and_Power_transformation
(target, function_to_apply='zscore', random_state_quantile=42)¶ -Given a data set, applies Min Max, Standar Scaler or Power Transformation (yeo-johnson) -it is recommended to run Define_dataTypes first - ignores target variable
- Args:
target: string , name of the target variable function_to_apply: string , default ‘zscore’ (standard scaler), all other {‘minmaxm’,’yj’,’quantile’,’robust’,’maxabs’} ( min max,yeo-johnson & quantile power transformation, robust and MaxAbs scaler )
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Simple_Imputer
(numeric_strategy, categorical_strategy, target_variable)¶ - Imputes all type of data (numerical,categorical & Time).
Highly recommended to run Define_dataTypes class first Numerical values can be imputed with mean or median categorical missing values will be replaced with “Other” Time values are imputed with the most frequesnt value Ignores target (y) variable Args:
Numeric_strategy: string , all possible values {‘mean’,’median’} categorical_strategy: string , all possible values {‘not_available’,’most frequent’} target: string , name of the target variable
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Surrogate_Imputer
(numeric_strategy, categorical_strategy, target_variable)¶ - Imputes feature with surrogate column (numerical,categorical & Time).
Highly recommended to run Define_dataTypes class first
it is also recommended to only apply this to features where it makes business sense to creat surrogate column
feature name has to be provided
only able to handle one feature at a time
Numerical values can be imputed with mean or median
categorical missing values will be replaced with “Other”
Time values are imputed with the most frequesnt value
Ignores target (y) variable
- Args:
feature_name: string, provide features name feature_type: string , all possible values {‘numeric’,’categorical’,’date’} strategy: string ,all possible values {‘mean’,’median’,’not_available’,’most frequent’} target: string , name of the target variable
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Target_Transformation
(target, function_to_apply='bc')¶ Applies Power Transformation (yeo-johnson , Box-Cox) to target variable (Applicable to Regression only) - ‘bc’ for Box_Coc & ‘yj’ for yeo-johnson, default is Box-Cox
if target containes negtive / zero values , yeo-johnson is automatically selected
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
-
class
pycaret.preprocess.
Zroe_NearZero_Variance
(target, threshold_1=0.1, threshold_2=20)¶ it eliminates the features having zero variance
it eliminates the features haveing near zero variance
Near zero variance is determined by -1) Count of unique points divided by the total length of the feature has to be lower than a pre sepcified threshold -2) Most common point(count) divided by the second most common point(count) in the feature is greater than a pre specified threshold Once both conditions are met , the feature is dropped
-Ignores target variable
- Args:
threshold_1: float (between 0.0 to 1.0) , default is .10 threshold_2: int (between 1 to 100), default is 20 tatget variable : string, name of the target variable
-
fit_transform
(dataset, y=None)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
numpy array of shape [n_samples, n_features_new]
Datasets¶
-
pycaret.datasets.
get_data
(dataset, save_copy=False, profile=False)¶ This function loads sample datasets that are available in the pycaret git repository. The full list of available datasets and their descriptions can be viewed by calling index.
data = get_data(‘index’)
This will display the list of available datasets that can be loaded using the get_data() function. For example, to load the credit dataset:
credit = get_data(‘credit’)
- Parameters
dataset (string) –
value of dataset (index) –
save_copy (bool, default = False) –
set to true, it saves a copy of the dataset to your local active directory. (When) –
profile (bool, default = False) –
set to true, a data profile for Exploratory Data Analysis will be displayed (If) –
an interactive HTML report. (in) –
Returns –
-------- –
DataFrame (Pandas dataframe is returned.) –
---------- –
Warnings –
--------- –
Use of get_data() requires internet connection. (-) –
Clustering¶
Classification¶
Arules¶
-
pycaret.arules.
create_model
(metric='confidence', threshold=0.5, min_support=0.05, round=4)¶ This function creates an association rules model using data and identifiers passed at setup stage. This function internally transforms the data for association rule mining.
setup() function must be called before using create_model()
from pycaret.datasets import get_data france get_data(‘france’) experiment_name = setup(data = data, transaction_id = ‘InvoiceNo’,
item_id = ‘ProductName’)
This will return dataframe containing rules sorted by metric param.
- Parameters
metric (string, default = 'confidence') –
to evaluate if a rule is of interest. Default is set to confidence. (Metric) –
available metrics include 'support', 'lift', 'leverage', 'conviction'. (Other) –
metrics are computed as follows (These) –
support(A->C) = support(A+C) [aka 'support'], range (-) –
confidence(A->C) = support(A+C) / support(A), range (-) –
lift(A->C) = confidence(A->C) / support(C), range (-) –
leverage(A->C) = support(A->C) - support(A)*support(C), (-) – range: [-1, 1]
conviction = [1 - support(C)] / [1 - confidence(A->C)], (-) – range: [0, inf]
threshold (float, default = 0.5) –
threshold for the evaluation metric, via the metric parameter, (Minimal) –
decide whether a candidate rule is of interest. (to) –
min_support (float, default = 0.05) –
float between 0 and 1 for minumum support of the itemsets returned. (A) –
support is computed as the fraction `transactions_where_item(s)_occur / (The) –
total_transactions`. –
round (integer, default = 4) –
of decimal places metrics in score grid will be rounded to. (Number) –
DataFrame: Dataframe containing rules of interest with all metrics ——— including antecedents, consequents, antecedent support,
consequent support, support, confidence, lift, leverage, conviction.
Setting low values for min_support may increase training time.
-
pycaret.arules.
get_rules
(data, transaction_id, item_id, ignore_items=None, metric='confidence', threshold=0.5, min_support=0.05)¶ Magic function to get Association Rules in Power Query / Power BI.
-
pycaret.arules.
plot_model
(model, plot='2d')¶ This function takes a model dataframe returned by create_model() function. ‘2d’ and ‘3d’ plots are available.
rule1 = create_model(metric=’confidence’, threshold=0.7, min_support=0.05) plot_model(rule1, plot=’2d’) plot_model(rule1, plot=’3d’)
- Parameters
model (DataFrame, default = none) –
returned by trained model using create_model() (DataFrame) –
plot (string, default = '2d') –
abbreviation of type of plot. The current list of plots supported are (Enter) –
Abbreviated String (Name) –
------------------ (---------) –
Confidence and Lift (2d) '2d' (Support,) –
Confidence and Lift (3d) '3d' (Support,) –
-
pycaret.arules.
setup
(data, transaction_id, item_id, ignore_items=None, session_id=None)¶ This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes three mandatory parameters: (i) dataframe {array-like, sparse matrix}, (ii) transaction_id param identifying basket and (iii) item_id param used to create rules. These three params are normally found in any transactional dataset. pycaret will internally convert the dataframe into a sparse matrix which is required for association rules mining.
from pycaret.datasets import get_data france get_data(‘france’)
- experiment_name = setup(data = data, transaction_id = ‘InvoiceNo’,
item_id = ‘ProductName’)
- Parameters
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features. (is) –
transaction_id (string) –
of column representing transaction id. This will be used to pivot the matrix. (Name) –
item_id (string) –
of column used for creation of rules. Normally, this will be the variable of (Name) –
interest. –
ignore_items (list, default = None) –
of strings to be ignored when considering rule mining. (list) –
session_id (int, default = None) –
None, a random seed is generated and returned in the Information grid. The (If) –
number is then distributed as a seed in all functions used during the (unique) –
This can be used for later reproducibility of the entire experiment. (experiment.) –
Returns –
-------- –
grid (info) –
----------- –
environment (This function returns various outputs that are stored in variable) –
as tuple. They are used by other functions in pycaret. (-----------) –
Warnings –
--------- –
None –
Anomaly¶
-
pycaret.anomaly.
assign_model
(model, transformation=False, score=True, verbose=True)¶ This function flags each of the data point in the dataset passed during setup stage as either outlier or inlier (1 = outlier, 0 = inlier) using trained model object passed as model param. create_model() function must be called before using assign_model().
This function returns dataframe with Outlier flag (1 = outlier, 0 = inlier) and decision score, when score is set to True.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly, normalize = True) knn = create_model(‘knn’)
knn_df = assign_model(knn)
This will return a dataframe with inferred outliers using trained model.
- Parameters
model (trained model object, default = None) –
transformation (bool, default = False) –
set to True, assigned outliers are returned on transformed dataset instead (When) –
original dataset passed during setup() (of) –
score (Boolean, default = True) –
outlier scores of the training data. The higher, the more abnormal. (The) –
tend to have higher scores. This value is available once the model (Outliers) –
fitted. If set to False, it will only return the flag (1 = outlier, 0 = inlier) (is) –
verbose (Boolean, default = True) –
update is not printed when verbose is set to False. (Status) –
Returns –
-------- –
dataframe (Returns a dataframe with inferred outliers using a trained model.) –
--------- –
Warnings –
--------- –
None –
-
pycaret.anomaly.
create_model
(model=None, fraction=0.05, verbose=True)¶ This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model().
This function returns a trained model object.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly, normalize = True)
knn = create_model(‘knn’)
This will return trained k-Nearest Neighbors model.
- Parameters
model (trained model object) –
abbreviated string of the model class. List of available models supported (Enter) –
Abbreviated String Original Implementation (Model) –
------------------ ----------------------- (---------) –
Outlier Detection 'abod' pyod.models.abod.ABOD (Angle-base) –
Forest 'iforest' module-pyod.models.iforest (Isolation) –
Local Outlier 'cluster' pyod.models.cblof (Clustering-Based) –
Outlier Factor 'cof' module-pyod.models.cof (Connectivity-Based) –
Outlier Detection 'histogram' module-pyod.models.hbos (Histogram-based) –
Neighbors Detector 'knn' module-pyod.models.knn (k-Nearest) –
Outlier Factor 'lof' module-pyod.models.lof (Local) –
SVM detector 'svm' module-pyod.models.ocsvm (One-class) –
Component Analysis 'pca' module-pyod.models.pca (Principal) –
Covariance Determinant 'mcd' module-pyod.models.mcd (Minimum) –
Outlier Detection 'sod' module-pyod.models.sod (Subspace) –
Outlier Selection 'sos' module-pyod.models.sos (Stochastic) –
fraction (float, default = 0.05) –
percentage / proportion of outliers in the dataset. (The) –
verbose (Boolean, default = True) –
update is not printed when verbose is set to False. (Status) –
Returns –
-------- –
model –
------ –
Warnings –
--------- –
None –
-
pycaret.anomaly.
deploy_model
(model, model_name, authentication, platform='aws')¶ (In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly, normalize=True) knn = create_model(‘knn’)
- deploy_model(model = knn, model_name = ‘deploy_knn’, platform = ‘aws’,
authentication = {‘bucket’ : ‘pycaret-test’})
This will deploy the model on an AWS S3 account under bucket ‘pycaret-test’
Before deploying a model to an AWS S3 (‘aws’), environment variables must be configured using the command line interface. To configure AWS env. variables, type aws configure in your python command line. The following information is required which can be generated using the Identity and Access Management (IAM) portal of your amazon console account:
AWS Access Key ID
AWS Secret Key Access
Default Region Name (can be seen under Global settings on your AWS console)
Default output format (must be left blank)
- Parameters
model (object) –
trained model object should be passed as an estimator. (A) –
model_name (string) –
of model to be passed as a string. (Name) –
authentication (dict) –
of applicable authentication tokens. (dictionary) – When platform = ‘aws’: {‘bucket’ : ‘Name of Bucket on S3’}
platform (string, default = 'aws') –
of platform for deployment. Current available options are (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
get_outliers
(data, model=None, fraction=0.05, ignore_features=None, normalize=True, transformation=False, pca=False, pca_components=0.99, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, remove_multicollinearity=False, multicollinearity_threshold=0.9)¶ Magic function to get outliers in Power Query / Power BI.
-
pycaret.anomaly.
load_experiment
(experiment_name)¶ This function loads a previously saved experiment from the current active directory into current python environment. Load object must be a pickle file.
saved_experiment = load_experiment(‘experiment_23122019’)
This will load the entire experiment pipeline into the object saved_experiment. The experiment file must be in current directory.
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Grid containing details of saved objects in experiment pipeline. (Information) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
load_model
(model_name, platform=None, authentication=None, verbose=True)¶ This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.
saved_knn = load_model(‘knn_model_23122019’)
This will load the previously saved model in saved_lr variable. The file must be in the current directory.
- Parameters
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
plot_model
(model, plot='tsne', feature=None)¶ This function takes a trained model object and returns a plot on the dataset passed during setup stage. This function internally calls assign_model before generating a plot.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly, normalize = True) knn = create_model(‘knn’)
plot_model(knn)
- Parameters
model (object) –
trained model object can be passed. Model must be created using create_model() (A) –
plot (string, default = 'tsne') –
abbreviation of type of plot. The current list of plots supported are (Enter) –
Abbreviated String (Name) –
------------------ (---------) –
(3d) Dimension Plot 'tsne' (t-SNE) –
Dimensionality Plot 'umap' (UMAP) –
feature (string, default = None) –
column is used as a hoverover tooltip. By default, first of column of the (feature) –
is chosen as hoverover tooltip, when no feature is passed. (dataset) –
Returns –
-------- –
Plot (Visual) –
------------ –
Warnings –
--------- –
None (-) –
-
pycaret.anomaly.
predict_model
(model, data, platform=None, authentication=None)¶ This function is used to predict new data using a trained model. It requires a trained model object created using one of the function in pycaret that returns a trained model object. New data must be passed to data param as pandas Dataframe.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly) knn = create_model(‘knn’)
knn_predictions = predict_model(model = knn, data = anomaly)
- Parameters
model (object / string, default = None) –
model is passed as string, load_model() is called internally to load the (When) –
file from active directory or cloud platform when platform param is passed. (pickle) –
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features. All features (is) –
during training must be present in the new dataset. (used) –
Returns –
-------- –
grid (info) –
---------- –
Warnings –
--------- –
Models that donot support 'predict' function cannot be used in predict_model() (-) –
-
pycaret.anomaly.
save_experiment
(experiment_name=None)¶ This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.
save_experiment()
This will save the entire experiment into the current active directory. By default, the name of the experiment will use the session_id generated during setup(). To use a custom name, a string must be passed to the experiment_name param. For example:
save_experiment(‘experiment_23122019’)
- Parameters
experiment_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
save_model
(model, model_name, verbose=True)¶ This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’) experiment_name = setup(data = anomaly, normalize = True) knn = create_model(‘knn’)
save_model(knn, ‘knn_model_23122019’)
This will save the transformation pipeline and model as a binary pickle file in the current directory.
- Parameters
model (object, default = none) –
trained model object should be passed. (A) –
model_name (string, default = none) –
of pickle file to be passed as a string. (Name) –
Returns –
-------- –
Message (Success) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
setup
(data, categorical_features=None, categorical_imputation='constant', ordinal_features=None, high_cardinality_features=None, numeric_features=None, numeric_imputation='mean', date_features=None, ignore_features=None, normalize=False, normalize_method='zscore', transformation=False, transformation_method='yeo-johnson', handle_unknown_categorical=True, unknown_categorical_method='least_frequent', pca=False, pca_method='linear', pca_components=None, ignore_low_variance=False, combine_rare_levels=False, rare_level_threshold=0.1, bin_numeric_features=None, remove_multicollinearity=False, multicollinearity_threshold=0.9, group_features=None, group_names=None, supervised=False, supervised_target=None, session_id=None, profile=False, verbose=True)¶ This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix}.
from pycaret.datasets import get_data anomaly = get_data(‘anomaly’)
experiment_name = setup(data = anomaly, normalize = True)
‘anomaly’ is a pandas Dataframe.
- Parameters
data ({array-like, sparse matrix}, shape (n_samples, n_features) where n_samples) –
the number of samples and n_features is the number of features in dataframe. (is) –
categorical_features (string, default = None) –
the inferred data types are not correct, categorical_features can be used to (If) –
the inferred type. If when running setup the type of 'column1' is (overwrite) –
as numeric instead of categorical, then this parameter can be used (inferred) –
overwrite the type by passing categorical_features = ['column1'] (to) –
categorical_imputation (string, default = 'constant') –
missing values are found in categorical features, they will be imputed with (If) –
constant 'not_available' value. The other available option is 'mode' which (a) –
the missing value using most frequent value in the training dataset. (imputes) –
ordinal_features (dictionary, default = None) –
the data contains ordinal features, they must be encoded differently using (When) –
ordinal_features param. If the data has a categorical variable with values (the) –
'low', 'medium', 'high' and it is known that low < medium < high, then it can (of) –
passed as ordinal_features = { 'column_name' (be) –
list sequence must be in increasing order from lowest to highest. (The) –
high_cardinality_features (string, default = None) –
the data containts features with high cardinality, they can be compressed (When) –
fewer levels by passing them as a list of column names with high cardinality. (into) –
are compressed using frequency distribution. As such original features (Features) –
replaced with the frequency distribution and converted into numeric variable. (are) –
numeric_features (string, default = None) –
the inferred data types are not correct, numeric_features can be used to (If) –
the inferred type. If when running setup the type of 'column1' is –
as a categorical instead of numeric, then this parameter can be used (inferred) –
overwrite by passing numeric_features = ['column1'] (to) –
numeric_imputation (string, default = 'mean') –
missing values are found in numeric features, they will be imputed with the (If) –
value of the feature. The other available option is 'median' which imputes (mean) –
value using the median value in the training dataset. (the) –
date_features (string, default = None) –
the data has a DateTime column that is not automatically detected when running (If) –
this parameter can be used by passing date_features = 'date_column_name'. (setup,) –
can work with multiple date columns. Date columns are not used in modeling. (It) –
feature extraction is performed and date columns are dropped from the (Instead,) –
If the date column includes a time stamp, features related to time will (dataset.) –
be extracted. (also) –
ignore_features (string, default = None) –
any feature should be ignored for modeling, it can be passed to the param (If) –
The ID and DateTime columns when inferred, are automatically (ignore_features.) –
to ignore for modeling. (set) –
normalize (bool, default = False) –
set to True, the feature space is transformed using the normalized_method (When) –
Generally, linear algorithms perform better with normalized data however, (param.) –
results may vary and it is advised to run multiple experiments to evaluate (the) –
benefit of normalization. (the) –
normalize_method (string, default = 'zscore') –
the method to be used for normalization. By default, normalize method (Defines) –
set to 'zscore'. The standard zscore is calculated as z = (x - u) / s. The (is) –
available options are (other) –
'minmax' (scales and translates each feature individually such that it is in) – the range of 0 - 1.
'maxabs' (scales and translates each feature individually such that the maximal) – absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
'robust' (scales and translates each feature according to the Interquartile range.) – When the dataset contains outliers, robust scaler often gives better results.
transformation (bool, default = False) –
set to True, a power transformation is applied to make the data more normal / (When) –
This is useful for modeling issues related to heteroscedasticity or (Gaussian-like.) –
situations where normality is desired. The optimal parameter for stabilizing (other) –
and minimizing skewness is estimated through maximum likelihood. (variance) –
transformation_method (string, default = 'yeo-johnson') –
the method for transformation. By default, the transformation method is set (Defines) –
'yeo-johnson'. The other available option is 'quantile' transformation. Both (to) –
transformation transforms the feature set to follow a Gaussian-like or normal (the) –
Note that the quantile transformer is non-linear and may distort linear (distribution.) –
between variables measured at the same scale. (correlations) –
handle_unknown_categorical (bool, default = True) –
set to True, unknown categorical levels in new / unseen data are replaced by (When) –
most or least frequent level as learned in the training data. The method is (the) –
under the unknown_categorical_method param. (defined) –
unknown_categorical_method (string, default = 'least_frequent') –
used to replace unknown categorical levels in unseen data. Method can be (Method) –
to 'least_frequent' or 'most_frequent'. (set) –
pca (bool, default = False) –
set to True, dimensionality reduction is applied to project the data into (When) –
lower dimensional space using the method defined in pca_method param. In (a) –
learning pca is generally performed when dealing with high feature (supervised) –
and memory is a constraint. Note that not all datasets can be decomposed (space) –
using a linear PCA technique and that applying PCA may result in loss (efficiently) –
information. As such, it is advised to run multiple experiments with different (of) –
to evaluate the impact. (pca_methods) –
pca_method (string, default = 'linear') –
'linear' method performs Linear dimensionality reduction using Singular Value (The) –
The other available options are (Decomposition.) –
kernel (dimensionality reduction through the use of RVF kernel.) –
incremental (replacement for 'linear' pca when the dataset to be decomposed is) – too large to fit in memory
pca_components (int/float, default = 0.99) –
of components to keep. if pca_components is a float, it is treated as a (Number) –
percentage for information retention. When pca_components is an integer (target) –
is treated as the number of features to be kept. pca_components must be strictly (it) –
than the original number of features in the dataset. (less) –
ignore_low_variance (bool, default = False) –
set to True, all categorical features with statistically insignificant variances (When) –
removed from the dataset. The variance is calculated using the ratio of unique (are) –
to the number of samples, and the ratio of the most common value to the (values) –
of the second most common value. (frequency) –
combine_rare_levels (bool, default = False) –
set to True, all levels in categorical features below the threshold defined (When) –
rare_level_threshold param are combined together as a single level. There must be (in) –
two levels under the threshold for this to take effect. rare_level_threshold (atleast) –
the percentile distribution of level frequency. Generally, this technique (represents) –
applied to limit a sparse matrix caused by high numbers of levels in categorical (is) –
features. –
rare_level_threshold (float, default = 0.1) –
distribution below which rare categories are combined. Only comes into (Percentile) –
when combine_rare_levels is set to True. (effect) –
bin_numeric_features (list, default = None) –
a list of numeric features is passed they are transformed into categorical (When) –
using KMeans, where values in each bin have the same nearest center of a (features) –
k-means cluster. The number of clusters are determined based on the 'sturges' (1D) –
It is only optimal for gaussian data and underestimates the number of bins (method.) –
large non-gaussian datasets. (for) –
remove_multicollinearity (bool, default = False) –
set to True, the variables with inter-correlations higher than the threshold (When) –
under the multicollinearity_threshold param are dropped. When two features (defined) –
highly correlated with each other, the feature with higher average correlation (are) –
the feature space is dropped. (in) –
multicollinearity_threshold (float, default = 0.9) –
used for dropping the correlated features. Only comes into effect when (Threshold) –
is set to True. (remove_multicollinearity) –
group_features (list or list of list, default = None) –
a dataset contains features that have related characteristics, the group_features (When) –
can be used for statistical feature extraction. For example, if a dataset has (param) –
features that are related with each other (i.e 'Col1', 'Col2', 'Col3'), a list (numeric) –
the column names can be passed under group_features to extract statistical (containing) –
such as the mean, median, mode and standard deviation. (information) –
group_names (list, default = None) –
group_features is passed, a name of the group can be passed into the group_names (When) –
as a list containing strings. The length of a group_names list must equal to the (param) –
of group_features. When the length doesn't match or the name is not passed, new (length) –
are sequentially named such as group_1, group_2 etc. (features) –
supervised (bool, default = False) –
set to True, supervised_target column is ignored for transformation. This (When) –
is only for internal use. (param) –
supervised_target (string, default = None) –
of supervised_target column that will be ignored for transformation. Only (Name) –
when tune_model() function is used. This param is only for internal use. (applciable) –
session_id (int, default = None) –
None, a random seed is generated and returned in the Information grid. The (If) –
number is then distributed as a seed in all functions used during the (unique) –
This can be used for later reproducibility of the entire experiment. (experiment.) –
profile (bool, default = False) –
set to true, a data profile for Exploratory Data Analysis will be displayed (If) –
an interactive HTML report. (in) –
verbose (Boolean, default = True) –
grid is not printed when verbose is set to False. (Information) –
Returns –
-------- –
grid (info) –
----------- –
environment (This function returns various outputs that are stored in variable) –
as tuple. They are used by other functions in pycaret. (-----------) –
Warnings –
--------- –
None –
-
pycaret.anomaly.
tune_model
(model=None, supervised_target=None, method='drop', estimator=None, optimize=None, fold=10)¶ This function tunes the fraction parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.
This function returns the tuned model object.
from pycaret.datasets import get_data boston = get_data(‘boston’) experiment_name = setup(data = boston, normalize = True)
tuned_knn = tune_model(model = ‘knn’, supervised_target = ‘medv’)
This will return tuned k-Nearest Neighbors model.
- Parameters
model (trained model object with best fraction param.) –
abbreviated name of the model. List of available models supported (Enter) –
Abbreviated String Original Implementation (Model) –
------------------ ----------------------- (---------) –
Outlier Detection 'abod' pyod.models.abod.ABOD (Angle-base) –
Forest 'iforest' module-pyod.models.iforest (Isolation) –
Local Outlier 'cluster' pyod.models.cblof (Clustering-Based) –
Outlier Factor 'cof' module-pyod.models.cof (Connectivity-Based) –
Outlier Detection 'histogram' module-pyod.models.hbos (Histogram-based) –
Neighbors Detector 'knn' module-pyod.models.knn (k-Nearest) –
Outlier Factor 'lof' module-pyod.models.lof (Local) –
SVM detector 'svm' module-pyod.models.ocsvm (One-class) –
Component Analysis 'pca' module-pyod.models.pca (Principal) –
Covariance Determinant 'mcd' module-pyod.models.mcd (Minimum) –
Outlier Detection 'sod' module-pyod.models.sod (Subspace) –
Outlier Selection 'sos' module-pyod.models.sos (Stochastic) –
supervised_target (string) –
of the target column for supervised learning. (Name) –
method (string, default = 'drop') –
method set to drop, it will drop the outlier rows from training dataset (When) –
supervised estimator, when method set to 'surrogate', it will use the (of) –
function and label as a feature without dropping the outliers from (decision) –
dataset. (training) –
estimator (string, default = None) –
Abbreviated String Task (Estimator) –
------------------ --------------- (---------) –
Regression 'lr' Classification (Logistic) –
Nearest Neighbour 'knn' Classification (K) –
Bayes 'nb' Classification (Naives) –
Tree 'dt' Classification (Decision) –
(Linear) 'svm' Classification (SVM) –
(RBF) 'rbfsvm' Classification (SVM) –
Process 'gpc' Classification (Gaussian) –
Level Perceptron 'mlp' Classification (Multi) –
Classifier 'ridge' Classification (Ridge) –
Forest 'rf' Classification (Random) –
Disc. Analysis 'qda' Classification (Quadratic) –
'ada' Classification (AdaBoost) –
Boosting 'gbc' Classification (Gradient) –
Disc. Analysis 'lda' Classification (Linear) –
Trees Classifier 'et' Classification (Extra) –
Gradient Boosting 'xgboost' Classification (Extreme) –
Gradient Boosting 'lightgbm' Classification (Light) –
Classifier 'catboost' Classification (CatBoost) –
Regression 'lr' Regression (Linear) –
Regression 'lasso' Regression (Lasso) –
Regression 'ridge' Regression (Ridge) –
Net 'en' Regression (Elastic) –
Angle Regression 'lar' Regression (Least) –
Least Angle Regression 'llar' Regression (Lasso) –
Matching Pursuit 'omp' Regression (Orthogonal) –
Ridge 'br' Regression (Bayesian) –
Relevance Determ. 'ard' Regression (Automatic) –
Aggressive Regressor 'par' Regression (Passive) –
Sample Consensus 'ransac' Regression (Random) –
Regressor 'tr' Regression (TheilSen) –
Regressor 'huber' Regression (Huber) –
Ridge 'kr' Regression (Kernel) –
Vector Machine 'svm' Regression (Support) –
Neighbors Regressor 'knn' Regression (K) –
Tree 'dt' Regression (Decision) –
Forest 'rf' Regression (Random) –
Trees Regressor 'et' Regression (Extra) –
Regressor 'ada' Regression (AdaBoost) –
Boosting 'gbr' Regression (Gradient) –
Level Perceptron 'mlp' Regression (Multi) –
Gradient Boosting 'xgboost' Regression (Extreme) –
Gradient Boosting 'lightgbm' Regression (Light) –
Regressor 'catboost' Regression (CatBoost) –
set to None, Linear model is used by default for both classification (If) –
regression tasks. (and) –
optimize (string, default = None) –
Classification tasks (For) –
AUC, Recall, Precision, F1, Kappa (Accuracy,) –
Regression tasks (For) –
MSE, RMSE, R2, RMSLE, MAPE (MAE,) –
set to None, default is 'Accuracy' for classification and 'R2' for (If) –
tasks. (regression) –
fold (integer, default = 10) –
of folds to be used in Kfold CV. Must be at least 2. (Number) –
Returns –
-------- –
plot (visual) –
optimize on y-axis. Also, prints the best model metric. (-----------) –
model –
----------- –
Warnings –
--------- –
None –