Learn to use TPOT: An AutoML Tool
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Reading time: 30 minutes | Coding time: 10 minutes
In this article, I will share some of my insights based on TPOT (Tree Base Pipeline Optimization Tool). I will explain this tool through a dataset I used.
What is TPOT?
TPOT stands for Tree Base Pipeline Optimization Tool. It is used to solve or give a idea on machine learning problems. It helps us to explore some of pipeline confiuration that we did not consider earlier for our model. It helps us to find out best algorithm for the problem we are working on. We feed the data which is the train input and train output. It analyzes the data and tell us the best machine learning model for the purpose.
TPOT is open source and is a part of scikit learn library. It can be used for both regression and classification models. Implementation and loading of library is different for each.
Working of TPOT (Using a dataset)
It is like a search algorithm which usually searches best algorithm for the purpose. The final search results basically depends on performance means which algorithm providing greater accuracy than other algorithms.It also tunes some hyperparametres for better performances and evaluation. So it cannot be considered as a random search algorithm.
Now generally there are two types of TPOT:
- TPOT classifier
- TPOT regressor.
We will study working of each using a dataset.
TPOT Classifier
For the purpose we have taken a dataset of a bank data which contains information about their customers.
Data cleaning was performed to get the data into required form. After performing several function we get the data into required form.
Response Variable: default payment next month
Exploratory Variable: ID, Marriage, Age, Bill_AMT1...Bill_AMT5
train=df.drop('default payment next month',axis=1)
test=df['default payment next month']
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
train=sc.fit_transform(train)
from sklearn import model_selection
x_train,x_test,y_train,y_test=model_selection.train_test_split(train,test)
Implementing TPOT Classifier
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score
The default TPOTClassifier arguments:
generations=100,
population_size=100,
offspring_size=None # Jeff notes this gets set to population_size
mutation_rate=0.9,
crossover_rate=0.1,
scoring="Accuracy", # for Classification
cv=5,
subsample=1.0,
n_jobs=1,
max_time_mins=None,
max_eval_time_mins=5,
random_state=None,
config_dict=None,
warm_start=False,
memory=None,
periodic_checkpoint_folder=None,
early_stop=None
verbosity=0
disable_update_check=False
Our Model:
pot = TPOTClassifier(
generations=5,
population_size=20,
verbosity=2,
scoring='roc_auc',
random_state=42,
disable_update_check=True,
config_dict='TPOT light'
)
tpot.fit(x_train, y_train)
After running the model we get the result:
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(x_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')
Output:
0.660
To know the best model:
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
print(f'{idx}. {transform}')
Further you can perform other algorithms like GridSearchCV to tune the hyperparametres.
Results we got after using the algorithm told by tpot
TPOT Regressor
For this purpose we considered a dataset from RBI website.
We performed the data cleaning part and recieve the data in required form
Response Variable: Growth
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
df=sc.fit_transform(df)
y=df[' Y-o-Y Growth in (7) (%)']
x=df.drop(' Y-o-Y Growth in (7) (%)',axis=1)
from sklearn import model_selection
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y)
Correlation of variable with response variable
Model Implementaion
from tpot import TPOTRegressor
from sklearn.metrics import roc_auc_score
tpot = TPOTRegressor(
generations=5,
population_size=50,
verbosity=2,
)
tpot.fit(x_train, y_train)
After running the model we get the results:
After using the algorithm told by TPOT we got excellent results
Useful Information about TPOT
There are some parametres on which tpot determine number of pipeline to be searched
-
generations: int, optional (default: 100)
Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations(and therefore time) to optimize the pipeline. TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total (emphasis mine). -
population_size: int, optional (default: 100)
Number of individuals to retain in the GP population every generation.
Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. -
offspring_size: int, optional (default: None)
Number of offspring to produce in each GP generation. By default, offspring_size = population_size.
Algoriths included in latest tpot update:
‘sklearn.naive_bayes.BernoulliNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] },
‘sklearn.naive_bayes.MultinomialNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] },
‘sklearn.tree.DecisionTreeClassifier’: { ‘criterion’: [“gini”, “entropy”], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21) },
‘sklearn.ensemble.ExtraTreesClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] },
‘sklearn.ensemble.RandomForestClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] },
‘sklearn.ensemble.GradientBoostingClassifier’: { ‘n_estimators’: [100], ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘max_features’: np.arange(0.05, 1.01, 0.05) },
‘sklearn.neighbors.KNeighborsClassifier’: { ‘n_neighbors’: range(1, 101), ‘weights’: [“uniform”, “distance”], ‘p’: [1, 2] },
‘sklearn.svm.LinearSVC’: { ‘penalty’: [“l1”, “l2”], ‘loss’: [“hinge”, “squared_hinge”], ‘dual’: [True, False], ‘tol’: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] },
‘sklearn.linear_model.LogisticRegression’: { ‘penalty’: [“l1”, “l2”], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], ‘dual’: [True, False] },
‘xgboost.XGBClassifier’: { ‘n_estimators’: [100], ‘max_depth’: range(1, 11), ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘min_child_weight’: range(1, 21), ‘nthread’: [1] }
Limitation
-
TPOT sometimes take very long for the search of algorithm. Since it searches all algorithm, apply them on data we provided which can take long time. If we provide data without any preprocessing steps, it would take even more time as it first implement those steps and then apply the algorithms.
-
In some cases TPOT shows different results for same data provided. This happens when we work on complex dataset
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.