In this article, we will explore the concept of Gradient optimization and the different types of Gradient Optimizers present in Deep Learning such as Mini-batch Gradient Descent Optimizer.
Table of Contents
- Machine Learning & Deep Learning
- Optimization Algorithms or Optimizers
- Gradient Descent Optimizer and its types
- Other types of Optimizers
- Summary of the details
|What is it?||Optimizers are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result of a Machine Learning/ Deep Learning model.|
|Number of types||8|
Machine Learning & Deep Learning
The process of computers learning from data to act as intelligence is called machine learning. It indicates using algorithms to find patterns among collected data to complete a particular task or make predictions. The collected data are processed in a supervised or unsupervised approach according to the need of the algorithms. The purpose of this method is to train computers on gathered data to handle tasks without any human interference. Machine learning is valued because of its speed and capability of solving complex problems that the human mind cannot easily comprehend.
Deep learning is a part of machine learning that follows a complex structure to process machine learning algorithms. A layered structure of networks in deep learning portrays the neural networks of the human brain. This is called Artificial Neural Network(ANN). This structure consists of input, output, and middle layers. These middle layers are called hidden layers and they are responsible for complex calculations. Deep learning requires a massive amount of data to make the results as accurate as possible. In order to reduce the loss of data, optimizers or optimization algorithms are used in deep learning.
Optimization Algorithms or Optimizers
Optimizers in machine learning are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result. Attributes of the optimization algorithms can be changed depending on the application. These attributes are:
- Sample: Represents an individual row of a dataset
- Epoch: Number of times an algorithm works on a dataset for training
- Batch: Number of samples used for training
- Cost/ Loss function: Calculates the loss and represents the difference between the actual and predicted values
- Learning rate: Indicates how fast a model can adapt to a problem
- Weights: Controls the influences between two neurons in a deep learning model
Optimizers use these parameters to improve the performance of a model. The optimizers of the neural network specify how these attributes need to change. The different types of optimizers are:
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Momentum Based Gradient Descent
- Nesterov Accelerated Gradient
In this article, only the Gradient-based optimizers will be discussed.
Gradient Descent Optimizer and its types
It is a commonly-used algorithm for machine learning models. It works to minimize the error updating parameters with each iteration. It behaves similarly to linear regression where the error between the actual output and predicted output was calculated to find the line that fits best.
In gradient descent, the slope of a curve is calculated using calculus and the slope is higher at the starting point. As parameters are adjusted, the slope will keep reducing until the lowest point of the curve is reached. The goal of this algorithm is to minimize the cost/loss function, which means the error between actual and predicted values.The formula is,
gradient_new = gradient_old - step_size * loss_function
In simpler form, g_new = g - s * f(g)
Types of gradient optimizers are:
1. Batch Gradient Descent Optimizer
Gradient descent is one of the most preferred algorithms to optimize neural networks. It is easy to understand, and easy to implement but it has its disadvantages. The cost of calculating the gradients depends on the size of the dataset. The process can be slow as the model is updated once the slope of the entire dataset is calculated. The model will keep adjusting the values of the attributes until the cost/loss function reaches close to equal to zero.
For example, if a ball is set loose from the top of a bowl, it will roll down to its center. The gradient is counted every step of the way in each iteration.
The method keeps updating the parameters and calculating the loss value until it finds a local minimum. A local minimum represents the point outside which the algorithm cannot proceed.
2. Stochastic Gradient Descent Optimizer
The stochastic gradient descent algorithm is used for tackling the problem of processing a large dataset and preventing the model from slowing down. It takes a few samples from the dataset and updates the parameters after each iteration. As a result, the number of iterations and overall computation time increases compared to BGD. The increased number of iterations can have noisy results for gradients/slopes.
The process selects the samples randomly as it proceeds. The loss function is calculated for each iteration. Therefore, finding a minimum requires less time. After every iteration, the loss function is tested to find the least error value. For this reason, the previous loss values don't need to be stored.
3. Mini-batch Gradient Descent Optimizer
Mini-batch gradient descent optimizer combines both BGD and SGD concepts. It splits the whole dataset and takes the subsets to calculate the loss function. These subsets are called batches. The parameters are updated on each of those batches. So, it requires fewer iterations and a lot less time to find the local minimum. This makes the process faster than both batch and stochastic gradient descent algorithms.
This process requires the memory to only store the data needed for training and calculating the error. This makes it more efficient. The iterations cause noisy gradients in this algorithm as well. However, the noise is much less than the stochastic gradient algorithm. Thus, this optimizer presents a balance between speed and accuracy.
Other types of Optimizers
1. Momentum Based Gradient Optimizer
The noise created in the mini-batch gradient optimizer after each update of parameters is overcome in this algorithm. If the noise of the loss function results can be reduced, the calculation time will be decreased. Using momentum means, increasing the speed of calculation when gradients point toward the relevant direction and decreasing it when the gradients change directions. Some hyperparameters are added to the updates after each iteration to speed up the convergence or the time to find a minimum point.
2. Nesterov Accelerated Gradient Optimizer
Though the momentum based gradient optimizer can be considered a good approach, it has its downside. Depending on the momentum, the algorithm finds the convergence. For higher momentum, this process can miss the minima.
To overcome this problem, Nesterov accelerated gradient optimizer is used and it is called the look-ahead approach. Where the momentum optimizer moves toward the direction of updated gradients, this algorithm moves toward the direction of previous gradients and makes corrections. Therefore, updating the model and slowly approaching the minimum point. This results in fewer iterations and saves more time.
Adagrad focuses on the learning rate for training the models. The key idea is to have an adaptive learning rate that can change according to the updates and no need to tune manually. It means the learning rate decreases if there are larger updates and the accumulated history of squared gradients keeps growing. Unfortunately, in this case, the learning rate can decrease massively and approach zero at one point. But the speed of computation is comparatively faster in AdaGrad.
An improvement to the learning rate decreasing problem of AdaGrad optimizer is AdaDelta. Instead of taking the sum of accumulated gradients, AdaDelta takes the average of the squared gradients. Based on this, it tunes the learning rate. The average is calculated from a fixed number of past squared gradients. There is no need to set an initial learning rate in AdaDelta. If the gradients are pointing in the right direction, the step size increases. But for the opposite case, the step size is reduced and the weights are updated accordingly.
Similar to Momentum based gradient descent, this algorithm stores the average of past gradients in addition to the average of past squared gradients as AdaDelta. Both of the averages are calculated to find the minima in this approach. It takes less memory and works efficiently with large datasets. This algorithm works faster compared to other optimizers.
Summary of the details
|Batch Gradient Descent||Updates the parameters once the gradient of the entire dataset is calculated||Easy to compute, understand and implement||Can be very slow and requires large memory|
|Stochastic Gradient Descent||Instead of the entire dataset the calculation is done on few samples of data||Faster than BGD and takes less memory||Gradient results can be noisy and takes a lot of time to find minima|
|Mini-batch Gradient Descent||Splits whole dataset into subsets and parameters are updated after calculating the loss function of the subsets||Faster and more efficient than SGD||For too small learning rate, the process can be very slow and the updated gradients can be noisy|
|Momentum Based Gradient Descent||Reduces the noise of updated gradients and makes the process faster||Faster convergence and takes less memory||Computation of a new parameter at each update|
|Nesterov Accelerated Gradient||Moves toward the direction of past gradients, makes corrections and slowly approaches minima||Decreases the number of iterations and makes the process faster||Computation of a new parameter at each update|
|AdaGrad||Focuses on the learning rate and it can adjust according to the updates based on the sum of past gradients||Learning rate changes automatically with iterations||Massive decrease in learning rate can lead to slow convergence|
|AdaDelta||Adjusts learning rate based on the average of past squared gradients||Learning rate does not decrease massively||Computation cost can be high|
|Adam||Computation is based on both the average of past gradients and past squared gradients||Faster than others||Computation cost can be high|
With this article at OpenGenus, you must have the complete idea of different types of Gradient Optimizers in Deep Learning.