×

Search anything:

# Types of Gradient Optimizers in Deep Learning

#### Deep Learning Machine Learning (ML)

In this article, we will explore the concept of Gradient optimization and the different types of Gradient Optimizers present in Deep Learning such as Mini-batch Gradient Descent Optimizer.

• Machine Learning & Deep Learning
• Optimization Algorithms or Optimizers
• Gradient Descent Optimizer and its types
• Other types of Optimizers
• Summary of the details
What is it?Optimizers are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result of a Machine Learning/ Deep Learning model.
Number of types8
Formulated in1847
InventorAugustin-Louis Cauchy
Types

# Machine Learning & Deep Learning

The process of computers learning from data to act as intelligence is called machine learning. It indicates using algorithms to find patterns among collected data to complete a particular task or make predictions. The collected data are processed in a supervised or unsupervised approach according to the need of the algorithms. The purpose of this method is to train computers on gathered data to handle tasks without any human interference. Machine learning is valued because of its speed and capability of solving complex problems that the human mind cannot easily comprehend.

Deep learning is a part of machine learning that follows a complex structure to process machine learning algorithms. A layered structure of networks in deep learning portrays the neural networks of the human brain. This is called Artificial Neural Network(ANN). This structure consists of input, output, and middle layers. These middle layers are called hidden layers and they are responsible for complex calculations. Deep learning requires a massive amount of data to make the results as accurate as possible. In order to reduce the loss of data, optimizers or optimization algorithms are used in deep learning.

# Optimization Algorithms or Optimizers

Optimizers in machine learning are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result. Attributes of the optimization algorithms can be changed depending on the application. These attributes are:

1. Sample: Represents an individual row of a dataset
2. Epoch: Number of times an algorithm works on a dataset for training
3. Batch: Number of samples used for training
4. Cost/ Loss function: Calculates the loss and represents the difference between the actual and predicted values
5. Learning rate: Indicates how fast a model can adapt to a problem
6. Weights: Controls the influences between two neurons in a deep learning model

Optimizers use these parameters to improve the performance of a model. The optimizers of the neural network specify how these attributes need to change. The different types of optimizers are:

# Gradient Descent Optimizer and its types

It is a commonly-used algorithm for machine learning models. It works to minimize the error updating parameters with each iteration. It behaves similarly to linear regression where the error between the actual output and predicted output was calculated to find the line that fits best.

In gradient descent, the slope of a curve is calculated using calculus and the slope is higher at the starting point. As parameters are adjusted, the slope will keep reducing until the lowest point of the curve is reached. The goal of this algorithm is to minimize the cost/loss function, which means the error between actual and predicted values.The formula is,

In simpler form, g_new = g - s * f(g)

Gradient descent is one of the most preferred algorithms to optimize neural networks. It is easy to understand, and easy to implement but it has its disadvantages. The cost of calculating the gradients depends on the size of the dataset. The process can be slow as the model is updated once the slope of the entire dataset is calculated. The model will keep adjusting the values of the attributes until the cost/loss function reaches close to equal to zero.

For example, if a ball is set loose from the top of a bowl, it will roll down to its center. The gradient is counted every step of the way in each iteration.

The method keeps updating the parameters and calculating the loss value until it finds a local minimum. A local minimum represents the point outside which the algorithm cannot proceed.

The stochastic gradient descent algorithm is used for tackling the problem of processing a large dataset and preventing the model from slowing down. It takes a few samples from the dataset and updates the parameters after each iteration. As a result, the number of iterations and overall computation time increases compared to BGD. The increased number of iterations can have noisy results for gradients/slopes.

The process selects the samples randomly as it proceeds. The loss function is calculated for each iteration. Therefore, finding a minimum requires less time. After every iteration, the loss function is tested to find the least error value. For this reason, the previous loss values don't need to be stored.

Mini-batch gradient descent optimizer combines both BGD and SGD concepts. It splits the whole dataset and takes the subsets to calculate the loss function. These subsets are called batches. The parameters are updated on each of those batches. So, it requires fewer iterations and a lot less time to find the local minimum. This makes the process faster than both batch and stochastic gradient descent algorithms.

This process requires the memory to only store the data needed for training and calculating the error. This makes it more efficient. The iterations cause noisy gradients in this algorithm as well. However, the noise is much less than the stochastic gradient algorithm. Thus, this optimizer presents a balance between speed and accuracy.

# Other types of Optimizers

The noise created in the mini-batch gradient optimizer after each update of parameters is overcome in this algorithm. If the noise of the loss function results can be reduced, the calculation time will be decreased. Using momentum means, increasing the speed of calculation when gradients point toward the relevant direction and decreasing it when the gradients change directions. Some hyperparameters are added to the updates after each iteration to speed up the convergence or the time to find a minimum point.

Though the momentum based gradient optimizer can be considered a good approach, it has its downside. Depending on the momentum, the algorithm finds the convergence. For higher momentum, this process can miss the minima.
To overcome this problem, Nesterov accelerated gradient optimizer is used and it is called the look-ahead approach. Where the momentum optimizer moves toward the direction of updated gradients, this algorithm moves toward the direction of previous gradients and makes corrections. Therefore, updating the model and slowly approaching the minimum point. This results in fewer iterations and saves more time.

Adagrad focuses on the learning rate for training the models. The key idea is to have an adaptive learning rate that can change according to the updates and no need to tune manually. It means the learning rate decreases if there are larger updates and the accumulated history of squared gradients keeps growing. Unfortunately, in this case, the learning rate can decrease massively and approach zero at one point. But the speed of computation is comparatively faster in AdaGrad.

Similar to Momentum based gradient descent, this algorithm stores the average of past gradients in addition to the average of past squared gradients as AdaDelta. Both of the averages are calculated to find the minima in this approach. It takes less memory and works efficiently with large datasets. This algorithm works faster compared to other optimizers.

# Summary of the details

Batch Gradient DescentUpdates the parameters once the gradient of the entire dataset is calculatedEasy to compute, understand and implementCan be very slow and requires large memory
Stochastic Gradient DescentInstead of the entire dataset the calculation is done on few samples of dataFaster than BGD and takes less memoryGradient results can be noisy and takes a lot of time to find minima
Mini-batch Gradient DescentSplits whole dataset into subsets and parameters are updated after calculating the loss function of the subsetsFaster and more efficient than SGDFor too small learning rate, the process can be very slow and the updated gradients can be noisy
Momentum Based Gradient DescentReduces the noise of updated gradients and makes the process fasterFaster convergence and takes less memoryComputation of a new parameter at each update
Nesterov Accelerated GradientMoves toward the direction of past gradients, makes corrections and slowly approaches minimaDecreases the number of iterations and makes the process fasterComputation of a new parameter at each update
AdaGradFocuses on the learning rate and it can adjust according to the updates based on the sum of past gradientsLearning rate changes automatically with iterationsMassive decrease in learning rate can lead to slow convergence