In this article at OpenGenus, we will learn to implement onehot encoded array in Python Programming Language using Numpy, ScikitLearn, Pandas, Keras, TensorFlow and Builtin Python methods.
Table of content
 Introduction
 Implementation in Numpy
 Implementation in other libraries
 Usecases in Machine Learning
 Importance in Machine Learning
Alright, Let's get started.
Introduction
Onehot encoding is a technique used to represent categorical data in a way that can be used as input in machine learning algorithms. The idea is to convert each category into a binary vector, where each element of the vector represents a possible category value and is either 1 or 0, depending on whether the category is present or not.
There are different ways to implement onehot encoding in Python using builtin functions and python libraries.
Implementation in Numpy
NumPy is a fundamental library in Python for scientific computing and provides many functions for data manipulation. There are many ways to perform onehot encoding in NumPy. We will discuss some of the important ways of implementation which are generally used.
1.1
 Iterate over each category in the categories array using
enumerate (categories)
. This provides both the index i and the category value. For each category, compare it with each element in the data array using the expression (data == category). This creates a boolean array of the same length as data where each element is
true
if it matches the current category andfalse
otherwise. Convert the boolean array to integers using
.astype(int)
. This converts true values to 1 and false values to 0.
import numpy as np
# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = np.array(['red', 'blue', 'green', 'yellow'])
# perform onehot encoding
encoded_data = np.zeros((len(data), len(categories)))
for i, category in enumerate(categories):
encoded_data[:, i] = (data == category).astype(int)
print(encoded_data)
Output
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
1.2
 Use NumPy indexing to assign 1s to the appropriate locations in b.
np.arange(a.size)
creates an array of indices corresponding to the rows of b, and a is used as the column indices. The expression
b[np.arange(a.size), a]
selects specific elements from b based on the row and column indices.
import numpy as np
a = np.array([1, 0, 3])
b = np.zeros((a.size, a.max() + 1))
b[np.arange(a.size), a] = 1
print(b)
Output
[[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]]
1.3
 Use
np.eye(n_values)
to create an identity matrix withn_values
number of rows and columns. This matrix serves as the basis for the onehot encoding. Index the identity matrix using values to obtain the corresponding onehot encoded array.
import numpy as np
values = [1, 0, 3]
n_values = np.max(values) + 1
one_hot_encoded = np.eye(n_values)[values]
print(one_hot_encoded)
Output
[[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]]
1.4
 Use
np.eye(num_classes)
to create an identity matrix with 'num_classes' and reshape the input array a to a 1D array usinga.reshape(1)
to ensure compatibility with indexing. Finally, use
np.squeeze()
to remove any singleton dimensions and obtain a 2D onehot encoded array if the input array a had more than one dimension.
import numpy as np
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(1)])
# Example values
a = np.array([2, 0, 1, 2])
num_classes = 4
one_hot_encoded = one_hot(a, num_classes)
print(one_hot_encoded)
Output
[[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]]
1.5
 Use
np.eye(x.max() + 1)
to create an identity matrix with the number of rows and columns equal to the maximum value in x plus 1. Index the identity matrix using the class vector x to obtain the corresponding onehot encoded array.
import numpy as np
# Example class vector
x = np.array([2, 1, 3, 2, 0, 1])
one_hot_encoded = np.eye(x.max() + 1)[x]
print(one_hot_encoded)
Output
[[0. 0. 1. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 1. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]]
1.6
 Define the
one_hot()
function that takes an array x and the depth (number of classes) as input. Use
np.eye(depth)
to create an identity matrix with depth number of rows and columns. This matrix serves as the basis for the onehot encoding. Use
np.take
to retrieve the corresponding rows from the identity matrix based on the values in x. The axis=0 argument ensures that the function works for arrays of any shape.
import numpy as np
def one_hot(x, depth: int):
return np.take(np.eye(depth), x, axis=0)
# Example data
x = np.array([[1, 2], [0, 3]])
one_hot_encoded = one_hot(x, depth=4)
print(one_hot_encoded)
Output
[[[0. 1. 0. 0.]
[0. 0. 1. 0.]]
[[1. 0. 0. 0.]
[0. 0. 0. 1.]]]
Implementation in other libraries
1. Using Python's builtin functions
Python provides builtin functions to perform onehot encoding using list comprehension and dictionary comprehension. Here's an example:
# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = ['red', 'blue', 'green', 'yellow']
# perform onehot encoding using list comprehension
encoded_data = [[1 if c == category else 0 for category in categories] for c in data]
print(encoded_data)
Output
[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]
Apart from NumPy, there are several Python libraries that support onehot encoding. Let's explore how to implement onehot encoding using different libraries one by one:
2. Using Scikitlearn
Scikitlearn is a popular machine learning library in Python that provides many tools for data preprocessing, including onehot encoding. Here are few examples using different classes in Scikitlearn.
2.1 Using OneHotEncoder
The OneHotEncoder
class takes categorical data as input and transforms it into a binary onehot encoded representation. It provides methods such as fit, transform, and fit_transform to fit the encoder to the data and perform the encoding.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = np.array([['red'], ['blue'], ['green'], ['yellow']])
# create encoder object
encoder = OneHotEncoder(categories=[categories])
# fit and transform data
encoded_data = encoder.fit_transform(data)
# convert sparse matrix to numpy array
encoded_data = encoded_data.toarray()
print(encoded_data)
Output
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
2.2 Using LabelBinarizer
The LabelBinarizer
class takes categorical target data and transforms it into a binary representation suitable for classification tasks. It provides methods such as fit, transform, and fit_transform to fit the binarizer to the data and perform the transformation.
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
Output
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
3. Using Pandas
Pandas is another popular library in Python for data manipulation and analysis. Pandas provides a convenient get_dummies()
function to perform onehot encoding. Here's an example:
import pandas as pd
# create example data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'yellow']})
# perform onehot encoding
encoded_data = pd.get_dummies(data)
print(encoded_data)
Output
color_blue color_green color_red color_yellow
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4. Using Keras
Keras is a popular highlevel neural network API in Python that provides many tools for deep learning. Keras provides a convenient to_categorical()
function to perform onehot encoding. Here's an example:
from keras.utils import to_categorical
# create example data
data = ['red', 'blue', 'green', 'yellow']
# perform onehot encoding
encoded_data = to_categorical(data)
print(encoded_data)
Output
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
5. Using TensorFlow
TensorFlow is another popular deep learning library in Python that provides many tools for machine learning. TensorFlow provides a convenient one_hot()
function to perform onehot encoding. Here's an example:
import tensorflow as tf
# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = tf.constant(['red', 'blue', 'green', 'yellow'])
# perform onehot encoding
encoded_data = tf.one_hot(tf.squeeze(tf.cast(tf.argmax(tf.cast(tf.equal(tf.expand_dims(data, 1), categories), tf.int32), axis=1), tf.uint8)), depth=len(categories))
print(encoded_data.numpy())
Output
[[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]]
Usecases in Machine Learning
Let's consider a few examples to better understand the onehot encoding technique.
1. OneHot Encoding for Colors
Suppose we have a dataset of shirts that we want to use to train a machine learning model to predict the color of a shirt. The color attribute has four categories: red, blue, green, and yellow.We can represent these categories using onehot encoding as follows:
import numpy as np
categories = ['red', 'blue', 'green', 'yellow']
# Create an empty array to hold the onehot encoded vectors
one_hot = np.zeros((len(categories), len(categories)), dtype=np.int32)
# Encode each category as a binary vector
for i, category in enumerate(categories):
one_hot[i, i] = 1
# Print the resulting onehot encoded array
print(one_hot)
Color Red Blue Green Yellow
Red 1 0 0 0
Blue 0 1 0 0
Green 0 0 1 0
Yellow 0 0 0 1
2. OneHot Encoding for Genres
Suppose we have a dataset of movies that we want to use to train a machine learning model to predict the genre of a movie. The genre attribute has five categories: action, comedy, drama, horror, and romance. We can represent these categories using onehot encoding as follows:
Genre Action Comedy Drama Horror Romance
Action 1 0 0 0 0
Comedy 0 1 0 0 0
Drama 0 0 1 0 0
Horror 0 0 0 1 0
Romance 0 0 0 0 1
3. OneHot Encoding for Letters
Suppose we have a dataset of letters that we want to use to train a machine learning model to predict the position of a letter in the alphabet. The letter attribute has 26 categories: A, B, C, ..., Z. We can represent these categories using onehot encoding as follows:
Letter A B C ... Y Z
A 1 0 0 ... 0 0
B 0 1 0 ... 0 0
C 0 0 1 ... 0 0
... . . . ... . .
Y 0 0 0 ... 1 0
Z 0 0 0 ... . 1
Importance in Machine Learning
Onehot encoding is important in machine learning for several reasons:

Enables the use of categorical variables in machine learning algorithms: Many machine learning algorithms require input variables to be numerical. Onehot encoding allows us to represent categorical variables as numerical variables, which can be used as input to these algorithms.

Avoids ordering bias: When encoding categorical variables using numbers (e.g., assigning the values 1, 2, 3 to three different categories), some algorithms may assume an inherent ordering or magnitude of the categories, which may not be true. Onehot encoding removes any implicit ordering or magnitude assumptions, and treats each category as a distinct and equally important feature.

Reduces the impact of the curse of dimensionality: Onehot encoding can increase the dimensionality of the feature space, but it also helps to reduce the impact of the curse of dimensionality. By creating a sparse matrix of binary values, onehot encoding can reduce the number of nonzero elements in a highdimensional feature space, which can help to improve the efficiency and accuracy of some machine learning algorithms.

Supports nonparametric models: Onehot encoding is particularly useful for nonparametric models, such as decision trees, random forests, and support vector machines, which can handle categorical data directly. These models can benefit from the sparsity and independence of onehot encoded features.
Overall, onehot encoding is an important technique for preprocessing categorical data in machine learning, and is a widely used method for representing categorical variables as numerical variables that can be used as input to machine learning algorithms.
With this article at OpenGenus, you must have the complete idea on how to implement onehot encoded array in python.