What is a Gradient Boosting Machine in ML? That is the first question that needs to be answered to a beginner to Machine Learning.
Index:
This article at OpenGenus is based on the following flow:
 Ensemble Algorithms  What are they?
 What is Gradient Boosting Machines?
 Introduction to gradient Boosting
 What is gradient in gradient boosting?
 What are hyperparameters of GBM?
 Gradient Descend Optimisation  an overview
 A basic GBM model implementation using skitilearn
 GBM from scratch implementation
 Conclusion and application
Before Learning about GBM we need to learn what Ensemble Algorithms are given that GBM is one example of its application.
Ensemble Algorithms  What are they?
Ensemble algorithms in machine learning are techniques that combine the predictions of multiple base models to create a more accurate and robust predictive model. The idea behind ensemble methods is that by combining the outputs of several models, the resulting ensemble can achieve better performance than individual models, by leveraging the strengths of different models and mitigating their weaknesses.
Ensemble algorithms are widely used in machine learning because they can improve the accuracy, stability, and generalization of predictive models. Ensemble methods are particularly effective in situations where a single model may not be sufficient to capture the complexity of the data or where different models may perform well on different subsets of the data. Ensemble methods can be applied to a wide range of machine learning tasks, including classification, regression, and anomaly detection.
There are several popular ensemble algorithms in machine learning, including:

Bagging: Bagging (Bootstrap Aggregating) is an ensemble technique that involves training multiple instances of the same base model on different subsets of the training data, obtained by randomly sampling with replacement (bootstrapping). The predictions of the base models are then combined, usually by averaging, to produce the final prediction.

Boosting: Boosting is an ensemble technique that combines the predictions of multiple base models in a weighted manner, where the weight of each model depends on its performance on the training data. Boosting algorithms, such as Gradient Boosting Machines (GBM) and AdaBoost, sequentially train a series of base models, with each subsequent model focusing on correcting the mistakes made by the previous models.

Stacking: Stacking (or Stacked Generalization) is an ensemble technique that involves training multiple base models on the same training data, and then using their predictions as inputs to train a metamodel, also known as a metaclassifier or metaregressor. The metamodel then makes the final prediction based on the combined outputs of the base models.

Random Forest: Random Forest is an ensemble technique that combines multiple decision trees, where each tree is trained on a random subset of the training data and uses a random subset of the features. The predictions of the individual trees are then combined, typically by averaging, to produce the final prediction.
Ensemble methods can significantly improve the performance of machine learning models by reducing overfitting, increasing prediction accuracy, and improving model robustness. However, they may also require careful tuning of hyperparameters and may be computationally expensive, as they involve training multiple base models.
What is Gradient Boosting Machines?
Introduction to gradient Boosting
Gradient Boosting Machines (GBM) are a type of machine learning ensemble algorithm that combines multiple weak learning models, typically decision trees, in order to create a more accurate and robust predictive model. GBM belongs to the family of boosting algorithms, where the main idea is to sequentially train a series of base models in a way that each subsequent model focuses on correcting the mistakes made by the previous models.
GBM has become a popular and powerful technique for various machine learning tasks, including classification, regression, and ranking. It is known for its ability to handle complex interactions and nonlinear relationships in the data, and its ability to handle missing values and outliers effectively. However, GBM can be computationally expensive and may require careful tuning of hyperparameters to achieve optimal performance.
What is 'Gradient' in Gradient Boosting Machines
The term "gradient" in GBM refers to the fact that the algorithm uses gradient descent optimization technique to minimize the errors made by the base models. During the training process, GBM adjusts the weights of the training samples to give more importance to the samples that were misclassified by the previous models. This allows GBM to gradually learn the patterns in the data and make increasingly accurate predictions.
In the context of optimization techniques used in machine learning, "gradient descent" refers to the method of updating the model parameters iteratively in the direction of the negative gradient of the loss function with respect to those parameters.
The gradient is the vector of partial derivatives of the loss function with respect to each parameter, and it indicates the direction and magnitude of the steepest increase in the loss function. By taking the negative gradient, we obtain the direction of the steepest decrease in the loss function. Therefore, gradient descent seeks to update the model parameters in a way that moves towards the direction of the negative gradient, with the goal of reaching the optimal values of the parameters that minimize the loss function.
The "descent" part of "gradient descent" refers to the fact that the algorithm moves in the direction of decreasing values of the loss function. By updating the parameters in the direction of the negative gradient, the algorithm tries to descend down the loss function surface towards the local or global minimum, where the loss function reaches its smallest value. This process is repeated iteratively until a convergence criterion is met, typically when the change in the loss function or the parameter values falls below a certain threshold.
Gradient descent is a fundamental optimization technique used in many machine learning algorithms, including linear regression, logistic regression, support vector machines, neural networks, and gradient boosting machines, among others. It allows the model to learn optimal parameter values from training data in order to make accurate predictions on unseen data.
HyperParameters of GBM
GBM has several key components, including the loss function, the base model (often decision trees), the learning rate, and the number of iterations (or boosting rounds). The loss function quantifies the difference between the predicted values and the actual values, and GBM iteratively minimizes this loss function. The base model is typically a shallow decision tree, called a "weak learner", which is trained on a subset of the data. The learning rate determines the contribution of each base model to the final prediction, and the number of iterations specifies how many base models are sequentially trained.
Gradient Boosting Machines (GBM) are a type of ensemble algorithm that consists of multiple hyperparameters that can be tuned to optimize the performance of the model. Some common hyperparameters of GBM models include:

n_estimators
: This hyperparameter specifies the number of boosting rounds or base models to be trained in the GBM ensemble. A higher value of n_estimators typically increases the model complexity and may result in better performance, but may also increase the risk of overfitting. 
learning_rate
: Also known as the "shrinkage" parameter, this hyperparameter controls the contribution of each base model to the final prediction. A lower value of learning_rate typically results in slower convergence and requires more boosting rounds to achieve the same level of performance, while a higher value of learning_rate may result in faster convergence but may risk overshooting the optimal solution. 
max_depth
: This hyperparameter specifies the maximum depth of the decision trees used as base models in the GBM ensemble. A deeper tree can capture more complex interactions in the data but may also increase the risk of overfitting. Setting an appropriate value for max_depth is important to balance model complexity and generalization. 
min_samples_split
: This hyperparameter specifies the minimum number of samples required to split an internal node during the construction of the decision trees. A higher value of min_samples_split may result in fewer splits and shallower trees, which can help reduce overfitting, but may also reduce the model's ability to capture finegrained patterns in the data. 
subsample
: This hyperparameter specifies the fraction of samples to be used for training each base model. A value less than 1.0 introduces stochasticity into the training process by randomly subsampling the training data, which can help improve model robustness and reduce overfitting. 
loss
: This hyperparameter specifies the loss function used to quantify the difference between the predicted values and the actual values during the training process. GBM supports various loss functions, such as "deviance" for binary and multiclass classification, and "mse" (mean squared error) or "mae" (mean absolute error) for regression. 
min_samples_leaf
: This hyperparameter specifies the minimum number of samples required to be at a leaf node of the decision trees. A higher value of min_samples_leaf can help reduce overfitting by preventing small leaf nodes that may capture noise, but may also result in less expressive trees. 
max_features
: This hyperparameter specifies the maximum number of features to consider for splitting at each node during the construction of the decision trees. A lower value of max_features can reduce the risk of overfitting, but may also result in less diverse trees.
These are some of the commonly used hyperparameters in GBM models. It's important to tune these hyperparameters appropriately based on the specific dataset and problem at hand to achieve the best performance of the GBM model. Grid search, randomized search, or other hyperparameter tuning techniques can be used to find the optimal combination of hyperparameter values.
Overview of Gradient Descend Optimisation
There are several ways to optimize the gradient descent algorithm in machine learning, including:

Learning rate tuning: The learning rate is a hyperparameter that determines the step size at each iteration of the gradient descent algorithm. A larger learning rate may result in faster convergence, but it can also cause the algorithm to overshoot the optimal values and fail to converge. On the other hand, a smaller learning rate may result in slower convergence. Tuning the learning rate is an important aspect of optimizing gradient descent, and it may require experimentation to find the optimal value for a particular problem.

Batch size selection: In minibatch gradient descent and stochastic gradient descent (SGD), the gradient is estimated based on a subset of training data, rather than the entire dataset used in batch gradient descent. The batch size determines the number of samples used for estimating the gradient at each iteration. A smaller batch size may result in faster updates and convergence, but with higher variance in the gradient estimates, while a larger batch size may result in slower updates and convergence, but with lower variance in the gradient estimates. Selecting an appropriate batch size is another aspect of optimizing gradient descent, and it may depend on the dataset size, computational resources, and problem complexity.

Feature scaling: Gradient descent can be sensitive to the scale of features in the input data. If the features have different scales, the gradient updates may be dominated by the features with larger values, leading to slow convergence or getting stuck in local optima. Therefore, it is often recommended to scale the features to a similar range before applying gradient descent. Common methods for feature scaling include normalization (scaling to [0, 1] or [1, 1] range) and standardization (scaling to zero mean and unit variance).

Regularization techniques: Regularization techniques, such as L1 regularization (Lasso) and L2 regularization (Ridge), can be used to add a penalty term to the loss function in order to control the complexity of the model and prevent overfitting. Regularization can help improve the generalization performance of the model and prevent the model from becoming overly sensitive to noise in the training data.

Momentum: Momentum is a technique that uses a moving average of the previous gradient updates to help the algorithm overcome local optima and accelerate convergence. By incorporating momentum, the algorithm gains inertia and can move past small local optima, leading to faster convergence.

Adaptive learning rate: Adaptive learning rate methods, such as AdaGrad, RMSprop, and Adam, dynamically adjust the learning rate during the training process based on the history of gradient updates. These methods adaptively update the learning rate for each parameter based on its own gradient history, which can lead to faster convergence and better performance compared to a fixed learning rate.

Early stopping: Early stopping is a technique that monitors the validation or test set performance during training and stops the training process when the performance starts to degrade. This prevents overfitting and helps in finding the optimal point of convergence faster.

Parallelization: Gradient descent can be parallelized to speed up the training process by computing the gradient updates for different subsets of training data or different model parameters in parallel. This can be achieved using techniques such as minibatch parallelism, data parallelism, or model parallelism, depending on the problem and available computational resources.

Initialization: Proper initialization of model parameters can also play a role in optimizing gradient descent. Initializing the parameters with small random values or using techniques such as Xavier initialization or He initialization can help in faster convergence and prevent the model from getting stuck in poor local optima.
These are some of the common ways to optimize gradient descent in machine learning. It is important to experiment with different optimization techniques and hyperparameter values to find the best combination for a particular problem as each problem has its own unique features and outputs.
A basic GBM model implementation using skitilearn
GBM is a part of the sklearn library to support the reccurent use of GBM in the industry. A basic implementation of the same using sklearn in python can be seen below:
# Import the required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the GBM classifier with hyperparameters
gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
# Train the GBM model
gbm.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = gbm.predict(X_test)
# Calculate accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we use the GradientBoostingClassifier
class from scikitlearn to create a GBM model for classification. We load the Iris dataset, split it into training and testing sets, and then initialize the GBM classifier with hyperparameters such as n_estimators
(number of boosting rounds), learning_rate
(shrinkage parameter), and max_depth
(maximum depth of the decision trees). We then fit the GBM model to the training data using the fit()
method, make predictions on the testing data using the predict()
method, and calculate the accuracy of the model using the accuracy_score()
function from scikitlearn's metrics module.
GBM from scratch implementation
A mathematical implementation of the model can be seen using a function based approach towards creating a GBM model. A simplified version of the same can be seen below.
import numpy as np
# Define the GBM class
class GradientBoostingClassifier:
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.trees = []
self.intercept = None
def fit(self, X, y):
# Initialize the predicted values with the mean of y
self.intercept = np.mean(y)
y_pred = np.full(y.shape, self.intercept)
# Fit n_estimators number of decision trees
for i in range(self.n_estimators):
residual = y  y_pred # Calculate the residuals
tree = self._build_tree(X, residual, depth=0) # Build a decision tree
self.trees.append(tree) # Add the tree to the list of trees
y_pred += self.learning_rate * tree.predict(X) # Update the predicted values
def _build_tree(self, X, y, depth):
if depth == self.max_depth:
return self._make_leaf(y)
best_feature, best_value = self._find_best_split(X, y)
left_mask = X[:, best_feature] <= best_value
right_mask = ~left_mask
tree = {'feature': best_feature,
'value': best_value,
'left': self._build_tree(X[left_mask], y[left_mask], depth + 1),
'right': self._build_tree(X[right_mask], y[right_mask], depth + 1)}
return tree
def _find_best_split(self, X, y):
best_gini = np.inf
best_feature = None
best_value = None
for feature in range(X.shape[1]):
for value in np.unique(X[:, feature]):
left_mask = X[:, feature] <= value
right_mask = ~left_mask
gini = self._gini(y[left_mask]) + self._gini(y[right_mask])
if gini < best_gini:
best_gini = gini
best_feature = feature
best_value = value
return best_feature, best_value
def _gini(self, y):
probs = np.bincount(y) / len(y)
gini = 1  np.sum(probs**2)
return gini
def _make_leaf(self, y):
leaf_value = np.mean(y)
return leaf_value
def predict(self, X):
y_pred = np.full(X.shape[0], self.intercept)
for tree in self.trees:
y_pred += self.learning_rate * tree.predict(X)
return y_pred
def _predict_tree(self, x, tree):
if tree is None:
return 0
if x[tree['feature']] <= tree['value']:
return self._predict_tree(x, tree['left'])
else:
return self._predict_tree(x, tree['right'])
This implementation defines a GradientBoostingClassifier class that includes methods for fitting the model (fit()
), making predictions (predict()
), and building decision trees (_build_tree()
), as well as helper methods for finding the best split (_find_best_split()
), calculating the Gini impurity (_gini()
), and making leaf nodes (_make_leaf()
).
Conclusion
Gradient Boosting Machines (GBM) is a powerful ensemble learning technique that combines multiple weak predictive models (typically decision trees) to create a strong predictive model. GBM iteratively builds a series of trees, each one focusing on correcting the errors made by the previous tree, by adjusting the predicted values of the target variable based on the gradient of the loss function. This results in an ensemble model that is capable of making accurate predictions and capturing complex interactions among features.
GBM has numerous applications in various domains of machine learning and data science. Some of the key applications of GBM include:

Regression: GBM can be used for predicting continuous numerical values, making it suitable for applications such as predicting housing prices, stock prices, or demand forecasting.

Classification: GBM can be used for predicting discrete categorical variables, making it suitable for applications such as spam detection, fraud detection, or image recognition.

Ranking: GBM can be used for building ranking models, which are useful in applications such as search engine ranking, recommendation systems, and personalized marketing.

Anomaly detection: GBM can be used for identifying rare events or anomalies in data, making it suitable for applications such as detecting fraud, network intrusion detection, or medical anomaly detection.

Feature selection: GBM can be used for feature selection or feature importance estimation, which helps in identifying the most important features for making accurate predictions and gaining insights into the data.

Time series forecasting: GBM can be used for time series forecasting, which involves predicting future values of a variable based on its historical values. This is useful in applications such as sales forecasting, demand forecasting, and weather forecasting.

Natural language processing: GBM can be used for text classification tasks such as sentiment analysis, topic modeling, and spam detection, making it suitable for applications in the field of natural language processing.

Medical and healthcare: GBM can be used for predicting disease outcomes, identifying potential drug targets, and analyzing medical data for diagnostic purposes.

Marketing and customer analytics: GBM can be used for customer segmentation, churn prediction, recommendation systems, and personalized marketing campaigns.
GBM is a versatile and widely used machine learning technique that has shown great success in a wide range of applications across various domains. Its ability to handle complex interactions among features, handle missing data, and make accurate predictions makes it a popular choice among data scientists and machine learning practitioners.
In case of any doubts, Feel free to reach out to me at shreyassukhadeve[at]gmail.com
. Till then, Happy Coding!