Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article at OpenGenus, we have explained the concept of Random Forest Classification in depth along with model implementation in Python.

Table of contents:

Introduction
Model workings
Model code using Python libraries
Model code from scratch without using ML-based Python libraries
Summary

1. Introduction

Random Forest Classification is a powerful machine learning algorithm used for classification tasks. It is an ensemble learning method that builds multiple decision trees and combines their predictions to produce a final output. Each decision tree is trained on a random subset of the data, and a random subset of features is selected for each split. This randomness reduces the risk of overfitting and makes the model more robust.

Components of a Random Forest Classifier

A Random Forest Classifier consists of the following components:

Decision trees: The basic building blocks of a Random Forest. Each decision tree is trained on a random subset of the data and a random subset of features for each split. The output of each decision tree is combined to make the final prediction.
Bagging: Random sampling of the training data with replacement. Bagging creates multiple subsets of the data, each of which is used to train a separate decision tree.
Random feature selection: Randomly selecting a subset of features at each split in a decision tree. This reduces the correlation between decision trees and improves the model's generalization performance.
Voting: Combining the predictions of multiple decision trees to make a final prediction. Different voting methods can be used, such as majority voting or weighted voting.
In the next section, we will explain how Random Forest works and its advantages over a single decision tree.

2. Model workings

How Random Forest works?

Random Forest works by building a large number of decision trees and then combining their predictions to produce a final output. Each decision tree is trained on a random subset of the training data, called a bootstrap sample. This process is called bagging, and it helps to reduce the variance of the model by introducing randomness and reducing overfitting.

In addition to bagging, Random Forest also uses random feature selection. At each split in a decision tree, only a random subset of features is considered. This helps to reduce the correlation between decision trees and improve the model's generalization performance.

Once all the decision trees are built, their predictions are combined using a voting method. For classification tasks, majority voting is often used, where the most frequent class prediction is chosen. For regression tasks, the average prediction is taken.

Advantages of Random Forest over a single decision tree

Random Forest has several advantages over a single decision tree. Some of these advantages are:

Reduced overfitting: By using bagging and random feature selection, Random Forest reduces the risk of overfitting and improves the model's generalization performance.
Better accuracy: Random Forest often produces better accuracy than a single decision tree by combining the predictions of multiple decision trees.
Robustness: Random Forest is less sensitive to noise and outliers in the data than a single decision tree, making it more robust.
Feature importance: Random Forest can provide a measure of feature importance, which can be useful for feature selection and understanding the data.

In the next section, we will see how to implement Random Forest Classification using python libraries.

3. Model code using Python libraries

Importing libraries and loading data

Before we start coding the Random Forest Classification model, let's import the necessary libraries and load the data. For this example, we will use the breast cancer dataset from scikit-learn library.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the data
data = load_breast_cancer()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

Training the model

Now that we have loaded the data, let's train the Random Forest Classification model using the RandomForestClassifier class from the scikit-learn library.

# Train the model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

Here, we have set the n_estimators parameter to 100, which means that we will build 100 decision trees. We have also set the max_depth parameter to 5, which limits the maximum depth of each decision tree to 5 levels.

Evaluating the model

After training the model, let's evaluate its performance on the testing set using the accuracy_score function from scikit-learn library.

# Make predictions on the testing set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In this example, we have achieved an accuracy of 96.49% on the testing set.

Random Forest Classification is a powerful algorithm that can handle complex data and produce accurate predictions. However, it is not always necessary to use pre-built libraries to implement this algorithm. In the next section, we will see how to implement Random Forest Classification from scratch without using any machine learning libraries.

4. Model code from scratch without using ML-based Python libraries

In this section, we will implement the Random Forest Classification algorithm from scratch without using any pre-built machine learning libraries. This will give us a better understanding of the algorithm and its inner workings.

Creating the Decision Tree

To create a decision tree, we first need to create a class for the nodes of the tree. Each node will have the following attributes:

feature_index: the index of the feature used for splitting at this node.
threshold: the threshold value for the feature used for splitting at this node.
left: the left child of this node.
right: the right child of this node.
value: the predicted value at this node.

class DecisionNode:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

Next, we can create a function to find the best split at each node. For this, we can calculate the information gain of each feature and select the feature with the highest information gain.

def _best_split(X, y):
    num_features = X.shape[1]
    best_feature_index, best_threshold = None, None
    best_gain = -1
    
    for feature_index in range(num_features):
        feature_values = X[:, feature_index]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            gain = _information_gain(y, X, feature_index, threshold)
            
            if gain > best_gain:
                best_feature_index = feature_index
                best_threshold = threshold
                best_gain = gain
                
    return best_feature_index, best_threshold

Here, we have used the _information_gain function to calculate the information gain of each feature. This function can be defined as follows:

def _information_gain(y, X, feature_index, threshold):
    parent_entropy = _entropy(y)
    left_indices = X[:, feature_index] < threshold
    right_indices = X[:, feature_index] >= threshold
    
    num_left, num_right = np.sum(left_indices), np.sum(right_indices)
    if num_left == 0 or num_right == 0:
        return 0
    
    left_entropy = _entropy(y[left_indices])
    right_entropy = _entropy(y[right_indices])
    
    child_entropy = (num_left/len(y))*left_entropy + (num_right/len(y))*right_entropy
    return parent_entropy - child_entropy

In this function, we have calculated the entropy of the parent node and the child nodes using the _entropy function. This function can be defined as follows:

def _entropy(y):
    _, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    entropy = sum(probabilities * -np.log2(probabilities))
    return entropy

Creating the Random Forest

To create the Random Forest, we can create a class that contains a list of decision trees. We can also use bootstrapping to create multiple subsets of the training data for each tree.

class RandomForest:
    def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.trees = []
        
    def fit(self, X, y):
        num_samples = X.shape[0]

5. Summary

Random forest classification is a powerful machine learning technique that combines multiple decision trees to improve the accuracy and stability of the model.

Advantages of using Random Forest Classification model:

It provides high accuracy and stability by reducing the risk of overfitting.
It can handle missing values and maintain the accuracy of a large proportion of the data.
It can handle high-dimensional data and reduce the dimensionality without losing information.
It provides feature importance scores that can be used for feature selection.

Cases where Random Forest works best for:

Classification and regression problems with large and complex datasets
Tasks that require high accuracy and stability such as medical diagnosis, fraud detection, and credit risk assessment.
Tasks that require feature selection and can handle missing values and noisy data.

In summary of this article at OpenGenus, Random Forest is a highly versatile machine learning technique that can handle large and complex datasets and produce accurate and stable predictions. It provides a powerful tool for feature selection and is especially useful in tasks that require high accuracy and stability.

Random Forest Classification

Machine Learning (ML)