PyCaret is end-to-end open source machine learning library for python programming language. Its primary objective is to reduce the cycle time of hypothesis to insights by providing an easy to use high level unified API. PyCaret's vision is to become defacto standard for teaching machine learning and data science. Our strength is in our easy to use unified interface for both supervised and unsupervised learning. It saves time and effort that citizen data scientists, students and researchers spent on coding or learning to code using different interfaces, so that now they can focus on business problem.
## Current Release
The current release is beta 0.0.27 (as of 28/01/2020). A full release is targetted in the first week of February 2020.
The current release is beta 0.0.28 (as of 29/01/2020). A full release is targetted in the first week of February 2020.
## Features Currently Available
As per beta 0.0.27 following modules are generally available:
As per beta 0.0.28 following modules are generally available:
* pycaret.datasets <br/>
* pycaret.classification (binary and multiclass) <br/>
* pycaret.regression <br/>
...
...
@@ -31,7 +31,7 @@ pip install pycaret
```
## Quick Start
As of beta 0.0.27 classification, regression, nlp, arules, anomaly and clustering modules are available.
As of beta 0.0.28 classification, regression, nlp, arules, anomaly and clustering modules are available.
Fixes multicollinearity between predictor variables , also considering the correlation between target variable.
Only applies to regression or two class classification ML use case
Takes numerical and one hot encoded variables only
Args:
threshold (float): The utmost absolute pearson correlation tolerated beyween featres from 0.0 to 1.0
target_variable (str): The target variable/column name
correlation_with_target_threshold: minimum absolute correlation required between every feature and the target variable , default 1.0 (0.0 to 1.0)
correlation_with_target_preference: float (0.0 to 1.0), default .08 ,while choosing between a pair of features w.r.t multicol & correlation target , this gives
the option to favour one measur to another. e.g. if value is .6 , during feature selection tug of war, correlation target measure will have a higher say.
self.merge['rank_x']=round(self.multicol_weight*(self.merge['avg_cor_y']-self.merge['avg_cor_x'])+self.target_corr_weight*(self.merge['target_variable_x']-self.merge['target_variable_y']),6)# round it to 6 digits
self.merge1=self.merge# delete here
## Now there will be rows where the rank will be exactly zero, these is where the value (corelartion between features) is exactly one ( like price and price^2)
## so in that case , we can simply pick one of the variable
# but since , features can be in either column, we will drop one column (say 'column') , only if the feature is not in the second column (in variable column)
# both equations below will return the list of columns to drop from here
# this is how it goes
## For the portion where correlation is exactly one !
self.one=self.merge[self.merge['rank_x']==0]
# this portion is complicated
# table one have all the paired variable having corelation of 1
# in a nutshell, we can take any column (one side of pair) and delete the other columns (other side of the pair)
# however one varibale can appear more than once on any of the sides , so we will run for loop to find all pairs...
# here it goes
# take a list of all (but unique ) variables that have correlation 1 for eachother, we will make two copies