OpenSource Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 30 minutes  Coding time: 10 minutes
In this post, we will see how to deal with an imbalanced dataset using SMOTE (Synthetic Minority Oversampling TEchnique). We will also see its implementation in Python.
Imbalanced Dataset
An individual in the domain of Machine Learning is likely to come across a dataset where the class labels distribution is significantly different. In simple words, Imbalanced Dataset usually reflects an unequal distribution of classes within a dataset.
The term accuracy can be highly misleading as a performance metric for such data.
Consider a dataset with 1000 data points having 950 points of class 1 and 50 points of class 0. If we have a model which predicts all observations as 1, the accuracy in such case would be 950/1000= 95%.
However, is it really a good model? A BIG NO :)
In such cases, we should analyze the Recall, Precision and F1scores depending on the business requirements. Consider an example of cancer detection; here, False Negatives are of primary concern. The value of false negatives should be as low as possible since we do not want our model to predict a cancerous patient as noncancerous.
A similar scenario happens in the case of detecting whether a given transaction is fraud or not. These metrics can be calculated using the following forumulas.
Dealing with Imbalanced Dataset
There exists a bunch of sampling techniques to deal with imbalanced data, which are primarily classified into
Undersampling
 In this sampling technique, the samples of the majority class are randomly removed to match the proportion of distribution when compared to the minority class.
 It is generally not preferred since we are losing valuable information just to match the proportion of 2 classes and may lead to bias.
Oversampling
 In this technique, we increase the samples of minority class to make samples of both minor and major classes as equal.
 One possible way is to replicate the samples of the minority class, and the other possible method is to generate some synthetic points using SMOTE.
 The random oversampling(replicating minority samples) is not preferred since it can lead to overfitting due to copying the same information.
What is SMOTE?
 SMOTE stands for Synthetic Minority Oversampling TEchnique.
 It is an oversampling technique in which new synthetic observations are created using the existing samples of the minority class.
 It generates virtual training records by linear interpolation for the minority class.
 These synthetic training records are generated by randomly selecting one or more of the knearest neighbours for each example in the minority class.
 After the oversampling process, the data is reconstructed, and several classification models can be applied for the processed data.
The various steps involved in SMOTE are
Step 1: Setting the minority class set A, for each $x \in A$, the knearest neighbors of x are obtained by calculating the Euclidean distance between x and every other sample in set A.
Step 2: The sampling rate N is set according to the imbalanced proportion. For each $x \in A$, N examples (i.e x1, x2, â€¦xn) are randomly selected from its knearest neighbors, and they construct the set $A_1$ .
Step 3: For each example $x_k \in A_1$ (k=1, 2, 3â€¦N), the following formula is used to generate a new example:
$x' = x + rand(0, 1) * \mid x  x_k \mid$
where rand(0, 1) denotes a random number between 0 and 1.

First, the initial distribution of the minority class is shown.

Say if the value of k nearest neighbour is 2. Each point will find its 2 nearest neighbours (say using euclidean distance). If we consider only point A initially, then B and C are its nearest neighbours.

By using step 3 in the above algorithm, new synthetic points are generated. It is not necessary to generate a single synthetic point on each line; it rather depends on the number of synthetic samples required. A single line can accommodate multiple synthetic points as well.
 Similarly, all points are considered, and synthetic observations are generated in a similar fashion for them also.
Implementation in Python
It is very easy to incorporate SMOTE using Python. We only have to install the imbalancedlearn package.
pip install imblearn
The dataset used is of Credit Card Fraud Detection from Kaggle and can be downloaded from here.
 Importing necessary packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
import seaborn as sns
 Reading the dataset
df = pd.read_csv('creditcard.csv')
 Analyzing class distribution
class_dist=df['Class'].value_counts()
print(class_dist)
print('\nClass 0: {:0.2f}%'.format(100 *loan_status_dist[0] / (class_dist[0]+class_dist[1])))
print('Class 1: {:0.2f}%'.format(100 *loan_status_dist[1] / (class_dist[0]+class_dist[1])))
Class 0: 99.83%
Class 1: 0.17%
 Splitting data into train and test test
X = df.drop(columns=['Time','Class'])
y = df['Class']
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=100,test_size=0.3,stratify=y)
 Evaluating results without SMOTE
model = LogisticRegression()
model.fit(x_train,y_train)
pred = model.predict(x_test)
print('Accuracy ',accuracy_score(y_test,pred))
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True,fmt='.2g')
If the recall measure is of main concern, we see that its value is only 0.59 and it certainly needs to be improved.
 Here we use the SMOTE module from imblearn
k_neighbours represents number of nearest to be consider while generating synthetic points.
sampling_strategy by default generates synthetic points equal to number of points in majority class. Since, here it is 0.5 it will generate synthetic points half of that of majority class points.
random_state is simply for reproducability purpose.
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy=0.5,k_neighbors=5,random_state = 100)
X_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel())
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))
 Results of SMOTE
lr = LogisticRegression()
lr.fit(X_train_res, y_train_res.ravel())
predictions = lr.predict(x_test)
print('Accuracy ',accuracy_score(y_test,predictions))
print(classification_report(y_test, predictions))
sns.heatmap(confusion_matrix(y_test,predictions),annot=True,fmt='.2g')
We can clearly see, the Recall value has significantly improved by using the SMOTE technique.
Conclusion
In this post we looked at why accuracy is not a good metric for an imbalanced data sceanario. After that we saw different sampling techniques (Under & Over sampling). We then looked at a deatiled explanation of SMOTE and its implementation in Python.