In this article at OpenGenus, we have explained the concept of Random Forest Classification in depth along with model implementation in Python.
Table of contents:
- Model workings
- Model code using Python libraries
- Model code from scratch without using ML-based Python libraries
Random Forest Classification is a powerful machine learning algorithm used for classification tasks. It is an ensemble learning method that builds multiple decision trees and combines their predictions to produce a final output. Each decision tree is trained on a random subset of the data, and a random subset of features is selected for each split. This randomness reduces the risk of overfitting and makes the model more robust.
Components of a Random Forest Classifier
A Random Forest Classifier consists of the following components:
- Decision trees: The basic building blocks of a Random Forest. Each decision tree is trained on a random subset of the data and a random subset of features for each split. The output of each decision tree is combined to make the final prediction.
- Bagging: Random sampling of the training data with replacement. Bagging creates multiple subsets of the data, each of which is used to train a separate decision tree.
- Random feature selection: Randomly selecting a subset of features at each split in a decision tree. This reduces the correlation between decision trees and improves the model's generalization performance.
- Voting: Combining the predictions of multiple decision trees to make a final prediction. Different voting methods can be used, such as majority voting or weighted voting.
In the next section, we will explain how Random Forest works and its advantages over a single decision tree.
2. Model workings
How Random Forest works?
Random Forest works by building a large number of decision trees and then combining their predictions to produce a final output. Each decision tree is trained on a random subset of the training data, called a bootstrap sample. This process is called bagging, and it helps to reduce the variance of the model by introducing randomness and reducing overfitting.
In addition to bagging, Random Forest also uses random feature selection. At each split in a decision tree, only a random subset of features is considered. This helps to reduce the correlation between decision trees and improve the model's generalization performance.
Once all the decision trees are built, their predictions are combined using a voting method. For classification tasks, majority voting is often used, where the most frequent class prediction is chosen. For regression tasks, the average prediction is taken.
Advantages of Random Forest over a single decision tree
Random Forest has several advantages over a single decision tree. Some of these advantages are:
- Reduced overfitting: By using bagging and random feature selection, Random Forest reduces the risk of overfitting and improves the model's generalization performance.
- Better accuracy: Random Forest often produces better accuracy than a single decision tree by combining the predictions of multiple decision trees.
- Robustness: Random Forest is less sensitive to noise and outliers in the data than a single decision tree, making it more robust.
- Feature importance: Random Forest can provide a measure of feature importance, which can be useful for feature selection and understanding the data.
In the next section, we will see how to implement Random Forest Classification using python libraries.
3. Model code using Python libraries
Importing libraries and loading data
Before we start coding the Random Forest Classification model, let's import the necessary libraries and load the data. For this example, we will use the breast cancer dataset from scikit-learn library.
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load the data data = load_breast_cancer() # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)
Training the model
Now that we have loaded the data, let's train the Random Forest Classification model using the
RandomForestClassifier class from the scikit-learn library.
# Train the model rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) rf_model.fit(X_train, y_train)
Here, we have set the
n_estimators parameter to 100, which means that we will build 100 decision trees. We have also set the
max_depth parameter to 5, which limits the maximum depth of each decision tree to 5 levels.
Evaluating the model
After training the model, let's evaluate its performance on the testing set using the
accuracy_score function from scikit-learn library.
# Make predictions on the testing set y_pred = rf_model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy)
In this example, we have achieved an accuracy of 96.49% on the testing set.
Random Forest Classification is a powerful algorithm that can handle complex data and produce accurate predictions. However, it is not always necessary to use pre-built libraries to implement this algorithm. In the next section, we will see how to implement Random Forest Classification from scratch without using any machine learning libraries.
4. Model code from scratch without using ML-based Python libraries
In this section, we will implement the Random Forest Classification algorithm from scratch without using any pre-built machine learning libraries. This will give us a better understanding of the algorithm and its inner workings.
Creating the Decision Tree
To create a decision tree, we first need to create a class for the nodes of the tree. Each node will have the following attributes:
- feature_index: the index of the feature used for splitting at this node.
- threshold: the threshold value for the feature used for splitting at this node.
- left: the left child of this node.
- right: the right child of this node.
- value: the predicted value at this node.
class DecisionNode: def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None): self.feature_index = feature_index self.threshold = threshold self.left = left self.right = right self.value = value
Next, we can create a function to find the best split at each node. For this, we can calculate the information gain of each feature and select the feature with the highest information gain.
def _best_split(X, y): num_features = X.shape best_feature_index, best_threshold = None, None best_gain = -1 for feature_index in range(num_features): feature_values = X[:, feature_index] thresholds = np.unique(feature_values) for threshold in thresholds: gain = _information_gain(y, X, feature_index, threshold) if gain > best_gain: best_feature_index = feature_index best_threshold = threshold best_gain = gain return best_feature_index, best_threshold
Here, we have used the
_information_gain function to calculate the information gain of each feature. This function can be defined as follows:
def _information_gain(y, X, feature_index, threshold): parent_entropy = _entropy(y) left_indices = X[:, feature_index] < threshold right_indices = X[:, feature_index] >= threshold num_left, num_right = np.sum(left_indices), np.sum(right_indices) if num_left == 0 or num_right == 0: return 0 left_entropy = _entropy(y[left_indices]) right_entropy = _entropy(y[right_indices]) child_entropy = (num_left/len(y))*left_entropy + (num_right/len(y))*right_entropy return parent_entropy - child_entropy
In this function, we have calculated the entropy of the parent node and the child nodes using the
_entropy function. This function can be defined as follows:
def _entropy(y): _, counts = np.unique(y, return_counts=True) probabilities = counts / len(y) entropy = sum(probabilities * -np.log2(probabilities)) return entropy
Creating the Random Forest
To create the Random Forest, we can create a class that contains a list of decision trees. We can also use bootstrapping to create multiple subsets of the training data for each tree.
class RandomForest: def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2): self.n_estimators = n_estimators self.max_depth = max_depth self.min_samples_split = min_samples_split self.trees =  def fit(self, X, y): num_samples = X.shape
Random forest classification is a powerful machine learning technique that combines multiple decision trees to improve the accuracy and stability of the model.
Advantages of using Random Forest Classification model:
- It provides high accuracy and stability by reducing the risk of overfitting.
- It can handle missing values and maintain the accuracy of a large proportion of the data.
- It can handle high-dimensional data and reduce the dimensionality without losing information.
- It provides feature importance scores that can be used for feature selection.
Cases where Random Forest works best for:
- Classification and regression problems with large and complex datasets
- Tasks that require high accuracy and stability such as medical diagnosis, fraud detection, and credit risk assessment.
- Tasks that require feature selection and can handle missing values and noisy data.
In summary of this article at OpenGenus, Random Forest is a highly versatile machine learning technique that can handle large and complex datasets and produce accurate and stable predictions. It provides a powerful tool for feature selection and is especially useful in tasks that require high accuracy and stability.