Random Forests using Scikit-learn


Reading time: 30 minutes | Coding time: 10 minutes

In this article, we will implement random forest in Python using Scikit-learn (sklearn). Random forest is an ensemble learning algorithm which means it uses many algorithms together or the same algorithm multiple times to get a more accurate prediction.

Random forest intuition

  1. First of all we will pick randomm data points from the training set.
  2. Build a decision tree associated to the selected m data points.
  3. Choose the number of decision trees you want to build and repeat steps 1 and 2.
  4. For a new data point say k, make each one of the decision trees predict the value of y for k and assign k, the average of all the predicted y values.

In random forest regression, there are many decision trees making prediction for the dependent variable.

Now let's build a Random forest regression model.

#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Importing the libraries numpy for linear algebra matrices, pandas for dataframe manipulation and matplotlib for plotting and we have written %matplotlib inline to view the plots in the jupyter notebook itself.

#importing the dataset
dataset=pd.read_csv('Position_Salaries.csv')

We are using the same dataset, in which we want to predict the salary for a new employee whose level of experience is 6.5 and he said that the previous company paid him 160000 and he wants a higher salary and we have got some data which has three columns- Position, Level and Salary. then here we will use random forest regression to predict his salary based on the data we have.

You can get more information on dataset by typing

dataset.info

This is the dataset.

    Position            Level  Salary
0   Business Analyst      1    45000
1  Junior Consultant      2    50000
2  Senior Consultant      3    60000
3            Manager      4    80000
4    Country Manager      5   110000
5     Region Manager      6   150000
6            Partner      7   200000
7     Senior Partner      8   300000
8            C-level      9   500000
9                CEO     10   1000000

Now we divide our dataset into X and y, where X is the independent variable and y is the dependent variable.

X=dataset.iloc[:,1:2].values
y=dataset.iloc[:,2].values
#fitting the random forest regression to the dataset
from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor(n_estimators=300,random_state=0)
regressor.fit(X,y)

We are training the entire dataset here and we will test it on any random value. Suppose the new employee said he has a experience of 6.5 years so we will predict his salary based on that.

#predicting the results
from numpy import array
y_pred=regressor.predict(array([[6.5]]))

Now let's check what is the predicted salary for the new employee.

y_pred

It returns 160333.33333333,Which is quite accurate and almost equal to the real value.

Now let's visualize the results

#visualising the Regression results
X_grid=np.arange(min(X),max(X),0.01)
X_grid=X_grid.reshape(len(X_grid),1)
plt.scatter(X,y,color='red')
plt.plot(X_grid,regressor.predict(X_grid),color='blue')
plt.title('Truth vs Bluff(Random Forest Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Random-forest

So, our prediction with random forest is quite accurate than decision trees. Random forest predicts quite better than decision trees in this case.