Basics of Gradient descent + Stochastic Gradient descent

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

We have explained the Basics of Gradient descent and Stochastic Gradient descent along with a simple implementation for SGD using Linear Regression.

Table of content:

  1. Basics of Gradient descent
  2. Gradient descent algorithm (batch gradient descent)
  3. Stochastic gradient descent
  4. Stochastic Gradient descent vs Batch Gradient descent
  5. Implementation of Stochastic Gradient Descent

Basics of Gradient descent

Gradient descent is an optimization function, that is used in Machine/Deep Learning. The overall goal is to minimize the objective convex function by means of using iteration.

We can define it with respect to simple linear regression. We start this, by taking some basic parameters (X). Then we proceed to compute the gradient (m and b / θ0 and θ1) of the function with respect to the input parameters. This will inform us of how the parameters can be changed to have the largest change on the objective function.

Then we update the parameters by changing them in the negative direction ( /∂θn J(θ1, θ2)) of the gradient - this will be a small, local change (α) of the selected parameters that will do the job of reducing the objective function. The above step is repeated until we can no longer reduce the objective function.
This particular gradient descent is called batch gradient descent, where all the value of the samples are taken together, and computed recursively, till the value of the error (J( θ0, θ1))is acceptably low enough.

Overall, the function should look like-

Linear Regression ( h(x) ) = θ0 + θ1x.
MSE ( Mean squared error - J(θ) ) = 1/m Σi = 1m (hθ(x(i)) - y(i))2
θn := θ - α d/J(θ) [n = 0, 1 here]

Gradient descent algorithm (batch gradient descent)

Steps of Gradient descent algorithm are:

  1. Initialize all the values of X and y.
  2. Compute the MSE for the given dataset, and calculate the new θn sequentially (that is, first calculate both θ0 and θ1 seperately, and then update them).
  3. For the given fixed value of epoch (set by the user), we will iterate the algorithm for the same amount.

Terminologies

  1. Cost function - They are used to determine the performance of a model. Lesser the cost function, better is the model. Our goal here is to minimize the cost function.
  2. Optimization - the process of iteratively updating the model parameters based on the loss function, such that the cost function is minimized.
  3. Convex function - a continuous function, whose value at the midpoint of every interval in its domain does not exceed the

Why Gradient descent?

The main reason why gradient descent is used is the computational complexity - faster to find the solution.
Since most of the problems in machine learning are convex, so gradient descent ensures that we will get to the outside.

Types of Gradient descent -

  1. Batch gradient descent
  2. Stochastic gradient descent
  3. Mini-batch gradient descent

Stochastic gradient descent

Earlier, we've used batch gradient descent, which takes all the parameters of X and y, and computes the MSE together. Although this may provide a smooth graph, but it is not feasible to work with the same, because of space complexity (when loading all the data).

When multiple local minima are provided, gradient descent fails to find the global minimum, if it is not initialized close to the global minimum.

In order to avoid bad local minima, we use stochastic gradient descent. The idea is to use a noisy estimate of the gradient - a random gradient, whose expected value is the true gradient.

Due to the noisy gradient, we can move in directions that are different from the gradient. This can prevent us from getting trapped in a small local minimum.

In stochastic gradient descent, we process one random training dataset, for every iteration. Since the parameter are updated after a single iteration, this makes it quite faster. If the number of dataset increases, the iteration cycle increases. The only disadvantage is that accuracy may not be achieved, but the computation of results will be faster.

Sample Implementation of SGD:

def SGD(X, y, lr=0.001, epoch=10, batch_size=1):  
    '''
    Stochastic Gradient Descent for a single feature
    '''
    # initialize parameters
    m, b = 0.5, 0.5
    # lists to store learning process
    log, mse = [], [] 
    
    for _ in range(epoch):
        
        indexes = np.random.randint(0, len(X), batch_size) # random sample
        
        Xs = np.take(X, indexes)
        ys = np.take(y, indexes)
        N = len(Xs)
        
        f = ys - (m*Xs + b)
        
        # Updating parameters m and b
        m -= lr * (-2 * Xs.dot(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)
        
        log.append((m, b))
        mse.append(mean_squared_error(y, m*X+b))        
    
    return m, b, log, mse

Stochastic Gradient descent vs Batch Gradient descent

Stochastic Gradient descent Batch Gradient descent
1. Uses the whole training sample 1. A single training sample is used
2. Slow, and resource-demanding algorithm 2. Faster and uses less resources than Batch Gradient descent
3. Not recommended for large training samples 3. Can be used for large training samples
4. Deterministic in nature 4. Stochastic in nature
5. Gives optimal solution given sufficient time and enough epoch to converge 5. It gives good solution but not always optimal
6.No need to shuffle data, as all training sets are taken 6. Data samples are shuffled in a random order for every epoch
7. It converges slowly 7. Convergence is reached faster
8. Can't escape shallow local minima that easily 8. SGD can escape shallow local minima

Implementation of Stochastic Gradient Descent

Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

Preparing California Housing Dataset

housing_data = fetch_california_housing()
Features = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
Target = pd.DataFrame(housing_data.target, columns=['Target'])

Join features with target

By using the df command, we can check the DataFrame.

df = Features.join(Target)
df

Check correlation

Correlations are a way to check the relation between two different parameters. Here, we are using the Pearson correlation (default for Pandas) to check the negative, neutral as well as positive correlation.

df.corr()

Preprocessing: Removing Outliers and Scaling

We will take only one feature here, that is MedInc, because we will be using simple linear regression (the same formula as mentioned above).

The reason why we need to remove outliers is that some data in the extremities are way too far from most of the data. If we check the "75%" row, we see that almost all of the data comes under 4.743250 (for MedInc) and 2.647250 (for Target), meanwhile the max value is 15.000100 and 5.000010 respectively.

df[['MedInc', 'Target']].describe()[1:]

Here, we will remove all data less than 8 for MedInc and 3.5 for Target.

# Removal of outliers for both Target and MedInc
df = df[df.Target < 3.5]
df = df[df.MedInc < 8]

We can then check the data, after removing the outliers.

df[['MedInc', 'Target']].describe()[1:]

Feature Scaling the Data

Feature scaling is essential in Machine Learning. Basically, we are trying to fit all of the data inside -1.0 to +1.0.

def scale(x):
    min = x.min()
    max = x.max()
    return pd.Series([(i - min)/(max - min) for i in x])

X = scale(df.MedInc)
y = scale(df.Target)

Test feature-scaling with the largest data

Here, we will check if the largest and smallest data is equal to -1.0 and +1.0 respectively.

X.max(), y.max()

Checking the correlation of price and income by means of plotting

This code block is only for visualizing the dataset as a plot.

plt.figure(figsize=(16,6))
plt.rcParams['figure.dpi'] = 227
plt.style.use('seaborn-whitegrid')
plt.scatter(X, y, label='Data', c='#388fd8', s=6)
plt.title('Positive Correlation Between Income and House Price', fontsize=15)
plt.xlabel('Income', fontsize=12)
plt.ylabel('House Price', fontsize=12)
plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6)
plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6)
plt.show()

Plot Regression Graph

The purpose of this function is to plot all the values, as well as the regression line that was predicted after iterating through each value,

def plot_regression(X, y, y_pred, log=None, title="Linear Regression"):
    plt.figure(figsize=(16,6))
    plt.rcParams['figure.dpi'] = 227
    plt.scatter(X, y, label='Data', c='#388fd8', s=6)
    if log != None:
        for i in range(len(log)):
            plt.plot(X, log[i][0]*X + log[i][1], lw=1, c='#caa727', alpha=0.35)
    plt.plot(X, y_pred, c='#ff7702', lw=3, label='Regression')
    plt.title(title, fontsize=14)
    plt.xlabel('Income', fontsize=11)
    plt.ylabel('Price', fontsize=11)
    plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6)
    plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6)
    plt.show()

This is the Stochastic Gradient descent function, where we can use input parameters like X and y, along with the learning rate, epoch and batch size(number of samples to be taken together while computing the error.).
One thing to note is that the values m and b have been booth initialized to 0.5 for the sake of simplicity. Log and MSE values are stored as list, which can then be used by plot_regression()

X = df.MedInc
y = df.Target

def SGD(X, y, lr=0.001, epoch=10, batch_size=1):  
    '''
    Stochastic Gradient Descent for a single feature
    '''
    # initialize parameters
    m, b = 0.5, 0.5
    # lists to store learning process
    log, mse = [], [] 
    
    for _ in range(epoch):
        
        indexes = np.random.randint(0, len(X), batch_size) # random sample
        
        Xs = np.take(X, indexes)
        ys = np.take(y, indexes)
        N = len(Xs)
        
        f = ys - (m*Xs + b)
        
        # Updating parameters m and b
        m -= lr * (-2 * Xs.dot(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)
        
        log.append((m, b))
        mse.append(mean_squared_error(y, m*X+b))        
    
    return m, b, log, mse

Finally, we will use the functions that we have defined earlier, and plot the graph. We also plot another graph, that scales with every epoch, and shows us the decrease in MSE.

m, b, log, mse = SGD(X, y, lr=0.005, epoch=100, batch_size=2)
y_pred = m*X + b

print("MSE:",mean_squared_error(y, y_pred))
plot_regression(X, y, y_pred, log=log, title="Linear Regression with SGD")

plt.figure(figsize=(16,3))
plt.rcParams['figure.dpi'] = 227
plt.plot(range(len(mse)), mse)
plt.title('SGD Optimization', fontsize=14)
plt.xlabel('Epochs', fontsize=11)
plt.ylabel('MSE', fontsize=11)
plt.show()

Overall code

# %% [markdown]
# # Stochastic Gradient Descent (based on [arseniyturin/SGD-From-Scratch](https://github.com/arseniyturin/SGD-From-Scratch))
# %% [markdown]
# ## Importing the libraries

# %%
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

# %% [markdown]
# ## Preparing California Housing Dataset

# %%
housing_data = fetch_california_housing()
Features = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
Target = pd.DataFrame(housing_data.target, columns=['Target'])

# %% [markdown]
# ## Join features with target

# %%
df = Features.join(Target)
df

# %% [markdown]
# ## Check correlation

# %%
df.corr()

# %% [markdown]
# ## Preprocessing: Removing Outliers and Scaling

# %%
# Before removing the outliers
df[['MedInc', 'Target']].describe()[1:]


# %%
# Removal of outliers for both Target and MedInc
df = df[df.Target < 3.5]
df = df[df.MedInc < 8]


# %%
# After removing the outliers
df[['MedInc', 'Target']].describe()[1:]

# %% [markdown]
# ## Feature Scaling the Data

# %%
def scale(x):
    min = x.min()
    max = x.max()
    return pd.Series([(i - min)/(max - min) for i in x])

X = scale(df.MedInc)
y = scale(df.Target)

# %% [markdown]
# ## Test feature-scaling with the largest data

# %%
X.max(), y.max()

# %% [markdown]
# ## Checking the correlation of price and income by means of plotting

# %%
plt.figure(figsize=(16,6))
plt.rcParams['figure.dpi'] = 227
plt.style.use('seaborn-whitegrid')
plt.scatter(X, y, label='Data', c='#388fd8', s=6)
plt.title('Positive Correlation Between Income and House Price', fontsize=15)
plt.xlabel('Income', fontsize=12)
plt.ylabel('House Price', fontsize=12)
plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6)
plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6)
plt.show()

# %% [markdown]
# ## Plot Regression Graph

# %%
def plot_regression(X, y, y_pred, log=None, title="Linear Regression"):
    plt.figure(figsize=(16,6))
    plt.rcParams['figure.dpi'] = 227
    plt.scatter(X, y, label='Data', c='#388fd8', s=6)
    if log != None:
        for i in range(len(log)):
            plt.plot(X, log[i][0]*X + log[i][1], lw=1, c='#caa727', alpha=0.35)
    plt.plot(X, y_pred, c='#ff7702', lw=3, label='Regression')
    plt.title(title, fontsize=14)
    plt.xlabel('Income', fontsize=11)
    plt.ylabel('Price', fontsize=11)
    plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6)
    plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6)
    plt.show()


# %%
X = df.MedInc
y = df.Target

def SGD(X, y, lr=0.001, epoch=10, batch_size=1):  
    '''
    Stochastic Gradient Descent for a single feature
    '''
    # initialize parameters
    m, b = 0.5, 0.5
    # lists to store learning process
    log, mse = [], [] 
    
    for _ in range(epoch):
        
        indexes = np.random.randint(0, len(X), batch_size) # random sample
        
        Xs = np.take(X, indexes)
        ys = np.take(y, indexes)
        N = len(Xs)
        
        f = ys - (m*Xs + b)
        
        # Updating parameters m and b
        m -= lr * (-2 * Xs.dot(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)
        
        log.append((m, b))
        mse.append(mean_squared_error(y, m*X+b))        
    
    return m, b, log, mse


# %%
m, b, log, mse = SGD(X, y, lr=0.005, epoch=100, batch_size=2)
y_pred = m*X + b

print("MSE:",mean_squared_error(y, y_pred))
plot_regression(X, y, y_pred, log=log, title="Linear Regression with SGD")

plt.figure(figsize=(16,3))
plt.rcParams['figure.dpi'] = 227
plt.plot(range(len(mse)), mse)
plt.title('SGD Optimization', fontsize=14)
plt.xlabel('Epochs', fontsize=11)
plt.ylabel('MSE', fontsize=11)
plt.show()

Conclusion

The above example was just a simple implementation for SGD using Linear Regression. This could further be improved by adding abstractions for multiple linear regression.

With this article at OpenGenus, you must have the complete idea of Basics of Gradient descent + Stochastic Gradient descent. Enjoy.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.