Get this book > Problems on Array: For Interviews and Competitive Programming
In this article, we will explore the concept of Gradient optimization and the different types of Gradient Optimizers present in Deep Learning such as Minibatch Gradient Descent Optimizer.
Table of Contents
 Machine Learning & Deep Learning
 Optimization Algorithms or Optimizers
 Gradient Descent Optimizer and its types
 Other types of Optimizers
 Summary of the details
Point  Gradient Optimizer 

What is it?  Optimizers are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result of a Machine Learning/ Deep Learning model. 
Number of types  8 
Formulated in  1847 
Inventor  AugustinLouis Cauchy 
Types 

Machine Learning & Deep Learning
The process of computers learning from data to act as intelligence is called machine learning. It indicates using algorithms to find patterns among collected data to complete a particular task or make predictions. The collected data are processed in a supervised or unsupervised approach according to the need of the algorithms. The purpose of this method is to train computers on gathered data to handle tasks without any human interference. Machine learning is valued because of its speed and capability of solving complex problems that the human mind cannot easily comprehend.
Deep learning is a part of machine learning that follows a complex structure to process machine learning algorithms. A layered structure of networks in deep learning portrays the neural networks of the human brain. This is called Artificial Neural Network(ANN). This structure consists of input, output, and middle layers. These middle layers are called hidden layers and they are responsible for complex calculations. Deep learning requires a massive amount of data to make the results as accurate as possible. In order to reduce the loss of data, optimizers or optimization algorithms are used in deep learning.
Optimization Algorithms or Optimizers
Optimizers in machine learning are used for minimizing the loss of data or the loss function and maximizing the efficiency of the result. Attributes of the optimization algorithms can be changed depending on the application. These attributes are:
 Sample: Represents an individual row of a dataset
 Epoch: Number of times an algorithm works on a dataset for training
 Batch: Number of samples used for training
 Cost/ Loss function: Calculates the loss and represents the difference between the actual and predicted values
 Learning rate: Indicates how fast a model can adapt to a problem
 Weights: Controls the influences between two neurons in a deep learning model
Optimizers use these parameters to improve the performance of a model. The optimizers of the neural network specify how these attributes need to change. The different types of optimizers are:
 Batch Gradient Descent
 Stochastic Gradient Descent
 MiniBatch Gradient Descent
 Momentum Based Gradient Descent
 Nesterov Accelerated Gradient
 Adagrad
 AdaDelta
 Adam
In this article, only the Gradientbased optimizers will be discussed.
Gradient Descent Optimizer and its types
It is a commonlyused algorithm for machine learning models. It works to minimize the error updating parameters with each iteration. It behaves similarly to linear regression where the error between the actual output and predicted output was calculated to find the line that fits best.
In gradient descent, the slope of a curve is calculated using calculus and the slope is higher at the starting point. As parameters are adjusted, the slope will keep reducing until the lowest point of the curve is reached. The goal of this algorithm is to minimize the cost/loss function, which means the error between actual and predicted values.The formula is,
gradient_new = gradient_old  step_size * loss_function
In simpler form, g_new = g  s * f(g)
Types of gradient optimizers are:
1. Batch Gradient Descent Optimizer
Gradient descent is one of the most preferred algorithms to optimize neural networks. It is easy to understand, and easy to implement but it has its disadvantages. The cost of calculating the gradients depends on the size of the dataset. The process can be slow as the model is updated once the slope of the entire dataset is calculated. The model will keep adjusting the values of the attributes until the cost/loss function reaches close to equal to zero.
For example, if a ball is set loose from the top of a bowl, it will roll down to its center. The gradient is counted every step of the way in each iteration.
The method keeps updating the parameters and calculating the loss value until it finds a local minimum. A local minimum represents the point outside which the algorithm cannot proceed.
2. Stochastic Gradient Descent Optimizer
The stochastic gradient descent algorithm is used for tackling the problem of processing a large dataset and preventing the model from slowing down. It takes a few samples from the dataset and updates the parameters after each iteration. As a result, the number of iterations and overall computation time increases compared to BGD. The increased number of iterations can have noisy results for gradients/slopes.
The process selects the samples randomly as it proceeds. The loss function is calculated for each iteration. Therefore, finding a minimum requires less time. After every iteration, the loss function is tested to find the least error value. For this reason, the previous loss values don't need to be stored.
3. Minibatch Gradient Descent Optimizer
Minibatch gradient descent optimizer combines both BGD and SGD concepts. It splits the whole dataset and takes the subsets to calculate the loss function. These subsets are called batches. The parameters are updated on each of those batches. So, it requires fewer iterations and a lot less time to find the local minimum. This makes the process faster than both batch and stochastic gradient descent algorithms.
This process requires the memory to only store the data needed for training and calculating the error. This makes it more efficient. The iterations cause noisy gradients in this algorithm as well. However, the noise is much less than the stochastic gradient algorithm. Thus, this optimizer presents a balance between speed and accuracy.
Other types of Optimizers
1. Momentum Based Gradient Optimizer
The noise created in the minibatch gradient optimizer after each update of parameters is overcome in this algorithm. If the noise of the loss function results can be reduced, the calculation time will be decreased. Using momentum means, increasing the speed of calculation when gradients point toward the relevant direction and decreasing it when the gradients change directions. Some hyperparameters are added to the updates after each iteration to speed up the convergence or the time to find a minimum point.
2. Nesterov Accelerated Gradient Optimizer
Though the momentum based gradient optimizer can be considered a good approach, it has its downside. Depending on the momentum, the algorithm finds the convergence. For higher momentum, this process can miss the minima.
To overcome this problem, Nesterov accelerated gradient optimizer is used and it is called the lookahead approach. Where the momentum optimizer moves toward the direction of updated gradients, this algorithm moves toward the direction of previous gradients and makes corrections. Therefore, updating the model and slowly approaching the minimum point. This results in fewer iterations and saves more time.
3. AdaGrad
Adagrad focuses on the learning rate for training the models. The key idea is to have an adaptive learning rate that can change according to the updates and no need to tune manually. It means the learning rate decreases if there are larger updates and the accumulated history of squared gradients keeps growing. Unfortunately, in this case, the learning rate can decrease massively and approach zero at one point. But the speed of computation is comparatively faster in AdaGrad.
4. AdaDelta
An improvement to the learning rate decreasing problem of AdaGrad optimizer is AdaDelta. Instead of taking the sum of accumulated gradients, AdaDelta takes the average of the squared gradients. Based on this, it tunes the learning rate. The average is calculated from a fixed number of past squared gradients. There is no need to set an initial learning rate in AdaDelta. If the gradients are pointing in the right direction, the step size increases. But for the opposite case, the step size is reduced and the weights are updated accordingly.
5. Adam
Similar to Momentum based gradient descent, this algorithm stores the average of past gradients in addition to the average of past squared gradients as AdaDelta. Both of the averages are calculated to find the minima in this approach. It takes less memory and works efficiently with large datasets. This algorithm works faster compared to other optimizers.
Summary of the details
Optimizer  Details  Advantages  Issues 

Batch Gradient Descent  Updates the parameters once the gradient of the entire dataset is calculated  Easy to compute, understand and implement  Can be very slow and requires large memory 
Stochastic Gradient Descent  Instead of the entire dataset the calculation is done on few samples of data  Faster than BGD and takes less memory  Gradient results can be noisy and takes a lot of time to find minima 
Minibatch Gradient Descent  Splits whole dataset into subsets and parameters are updated after calculating the loss function of the subsets  Faster and more efficient than SGD  For too small learning rate, the process can be very slow and the updated gradients can be noisy 
Momentum Based Gradient Descent  Reduces the noise of updated gradients and makes the process faster  Faster convergence and takes less memory  Computation of a new parameter at each update 
Nesterov Accelerated Gradient  Moves toward the direction of past gradients, makes corrections and slowly approaches minima  Decreases the number of iterations and makes the process faster  Computation of a new parameter at each update 
AdaGrad  Focuses on the learning rate and it can adjust according to the updates based on the sum of past gradients  Learning rate changes automatically with iterations  Massive decrease in learning rate can lead to slow convergence 
AdaDelta  Adjusts learning rate based on the average of past squared gradients  Learning rate does not decrease massively  Computation cost can be high 
Adam  Computation is based on both the average of past gradients and past squared gradients  Faster than others  Computation cost can be high 
With this article at OpenGenus, you must have the complete idea of different types of Gradient Optimizers in Deep Learning.