Hyper Parameter Tuning


Reading time: 25 minutes

That delightful feeling when you see your first model running successfully is incomparable. Then to make it more preeminent you strive for building a perfect model and ask yourself 'How can I improve the accuracy of the model?, Is there any way to speed up the training process?'. These aspects become more prominent when you've built a deep neural network.

Hyperparameter tuning is one of the features that come to the fore to conquer the battle of maximizing the performance of the model or maximizing the model's predictive accuracy.

"Hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm".

In order to understand this process, we first need to understand the difference between a model parameter and a model hyperparameter.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data. However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins. Some examples of hyperparameters are :

  • Learning rate,
  • Number of clusters in k-means clustering,
  • Number of hidden layers in a deep neural network, etc.

Hyperparameters are often used in processes to help estimate model parameters and are often specified by the practitioner.

Tuning Strategies

The process for finding the right hyperparameters is still somewhat of a dark art, and it currently involves either random search or grid search across cartesian products of sets of hyperparameters.

There are bunch of methods available for tuning of hyperparameters. In this blog, I chose to demonstrate using two popular methods. First one is 'Grid search' and the second one is 'Random Search'.

Grid Search

GridSearch takes a dictionary of all of the different hyperparameters that you want to test, and then feeds all of the different combinations through the algorithm for you and then reports back to you which one had the highest accuracy.
Let’s consider the following example:

Suppose, a machine learning model X takes hyperparameters a1, a2 and a3. In grid searching, you first define the range of values for each of the hyperparameters a1, a2 and a3. You can think of this as an array of values for each of the hyperparameters. Now the grid search technique will construct many versions of X with all the possible combinations of hyperparameter (a1, a2 and a3) values that you defined in the first place. This range of hyperparameter values is referred to as the grid.

Suppose, you defined the grid as:
a1 = [0,1,2,3,4,5]
a2 = [10,20,30,40,5,60]
a3 = [105,105,110,115,120,125]

Note that, the array of values of that you are defining for the hyperparameters has to be legitimate in a sense that you cannot supply Floating type values to the array if the hyperparameter only takes Integer values.

Now, grid search will begin its process of constructing several versions of X with the grid that you just defined.

It will start with the combination of [0,10,105], and it will end with [5,60,125]. It will go through all the intermediate combinations between these two which makes grid search computationally very expensive.

In scikit-learn this technique is provided in the GridSearchCV class. When constructing this class you must provide a dictionary of hyperparameters to evaluate in the param_grid argument. This is a map of the model parameter name and an array of values to try.

By default, accuracy is the score that is optimized, but other scores can be specified in the score argument of the GridSearchCV constructor.

By default, the grid search will only use one thread. By setting the n_jobs argument in the GridSearchCV constructor to -1, the process will use all cores on your machine. Depending on your Keras backend, this may interfere with the main neural network training process.

The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate each individual model and the default of 3-fold cross validation is used, although this can be overridden by specifying the cv argument to the GridSearchCV constructor.

Below is an example of defining a simple grid search:

param_grid = dict(epochs=[10,20,30])
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

Once completed, you can access the outcome of the grid search in the result object returned from grid.fit().

Random Search

The idea of random searching of hyperparameters was proposed by James Bergstra & Yoshua Bengio. You can check the original paper here.

Random search differs from a grid search. In that you longer provide a discrete set of values to explore for each hyperparameter; rather, you provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

Before going any further, let’s understand what distribution and sampling mean:

In Statistics, by distribution, it is essentially meant an arrangement of values of a variable showing their observed or theoretical frequency of occurrence.

On the other hand, Sampling is a term used in statistics. It is the process of choosing a representative sample from a target population and collecting data from that sample in order to understand something about the population as a whole.

Now let's again get back to the concept of random search.

You’ll define a sampling distribution for each hyperparameter. You can also define how many iterations you’d like to build when searching for the optimal model. For each iteration, the hyperparameter values of the model will be set by sampling the defined distributions. One of the primary theoretical backings to motivate the use of a random search in place of grid search is the fact that for most cases, hyperparameters are not equally important.
In scikit-learn this technique is provided in the RandomizedSearchCV class.

Following code illustrates how to use RandomizedSearchCV:

from scipy.stats import randint 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import RandomizedSearchCV 
  
# Creating the hyperparameter grid  
param_dist = {"max_depth": [3, None], 
              "max_features": randint(1, 9), 
              "min_samples_leaf": randint(1, 9), 
              "criterion": ["gini", "entropy"]} 
  
# Instantiating Decision Tree classifier 
tree = DecisionTreeClassifier() 
  
# Instantiating RandomizedSearchCV object 
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5) 
  
tree_cv.fit(X, y) 
  
# Print the tuned parameters and score 
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) 
print("Best score is {}".format(tree_cv.best_score_)) 

Output:

Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’: 5, ‘criterion’: ‘gini’}
Best score is 0.7265625

Conclusion

Model parameters are estimated from data automatically and model hyperparameters are set manually and are used in processes to help estimate model parameters.

Model hyperparameters are often referred to as parameters because they are the parts of the machine learning that must be set manually and tuned.

Moreover in this article, you learned about parameters and hyperparameters of a machine learning model and their differences as well.