Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article at OpenGenus, we will learn to implement one-hot encoded array in Python Programming Language using Numpy, Scikit-Learn, Pandas, Keras, TensorFlow and Built-in Python methods.

Table of content

Introduction
Implementation in Numpy
Implementation in other libraries
Use-cases in Machine Learning
Importance in Machine Learning

Alright, Let's get started.

Introduction

One-hot encoding is a technique used to represent categorical data in a way that can be used as input in machine learning algorithms. The idea is to convert each category into a binary vector, where each element of the vector represents a possible category value and is either 1 or 0, depending on whether the category is present or not.

There are different ways to implement one-hot encoding in Python using built-in functions and python libraries.

Implementation in Numpy

NumPy is a fundamental library in Python for scientific computing and provides many functions for data manipulation. There are many ways to perform one-hot encoding in NumPy. We will discuss some of the important ways of implementation which are generally used.

1.1

Iterate over each category in the categories array using enumerate (categories). This provides both the index i and the category value.

For each category, compare it with each element in the data array using the expression (data == category). This creates a boolean array of the same length as data where each element is true if it matches the current category and false otherwise.

Convert the boolean array to integers using .astype(int). This converts true values to 1 and false values to 0.

import numpy as np

# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = np.array(['red', 'blue', 'green', 'yellow'])

# perform one-hot encoding
encoded_data = np.zeros((len(data), len(categories)))
for i, category in enumerate(categories):
    encoded_data[:, i] = (data == category).astype(int)

print(encoded_data)

Output

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

1.2

Use NumPy indexing to assign 1s to the appropriate locations in b. np.arange(a.size) creates an array of indices corresponding to the rows of b, and a is used as the column indices.

The expression b[np.arange(a.size), a] selects specific elements from b based on the row and column indices.

import numpy as np

a = np.array([1, 0, 3])
b = np.zeros((a.size, a.max() + 1))
b[np.arange(a.size), a] = 1

print(b)

Output

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]

1.3

Use np.eye(n_values) to create an identity matrix with n_values number of rows and columns. This matrix serves as the basis for the one-hot encoding.

Index the identity matrix using values to obtain the corresponding one-hot encoded array.

import numpy as np

values = [1, 0, 3]
n_values = np.max(values) + 1

one_hot_encoded = np.eye(n_values)[values]

print(one_hot_encoded)

Output

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]

1.4

Use np.eye(num_classes) to create an identity matrix with 'num_classes' and reshape the input array a to a 1D array using a.reshape(-1) to ensure compatibility with indexing.

Finally, use np.squeeze() to remove any singleton dimensions and obtain a 2D one-hot encoded array if the input array a had more than one dimension.

import numpy as np

def one_hot(a, num_classes):
   return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

# Example values
a = np.array([2, 0, 1, 2])
num_classes = 4

one_hot_encoded = one_hot(a, num_classes)

print(one_hot_encoded)

Output

[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]]

1.5

Use np.eye(x.max() + 1) to create an identity matrix with the number of rows and columns equal to the maximum value in x plus 1.

Index the identity matrix using the class vector x to obtain the corresponding one-hot encoded array.

import numpy as np

# Example class vector
x = np.array([2, 1, 3, 2, 0, 1])

one_hot_encoded = np.eye(x.max() + 1)[x]

print(one_hot_encoded)

Output

[[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]

1.6

Define the one_hot() function that takes an array x and the depth (number of classes) as input.

Use np.eye(depth) to create an identity matrix with depth number of rows and columns. This matrix serves as the basis for the one-hot encoding.

Use np.take to retrieve the corresponding rows from the identity matrix based on the values in x. The axis=0 argument ensures that the function works for arrays of any shape.

import numpy as np

def one_hot(x, depth: int):
    return np.take(np.eye(depth), x, axis=0)

# Example data
x = np.array([[1, 2], [0, 3]])

one_hot_encoded = one_hot(x, depth=4)

print(one_hot_encoded)

Output

[[[0. 1. 0. 0.]
  [0. 0. 1. 0.]]

 [[1. 0. 0. 0.]
  [0. 0. 0. 1.]]]

Implementation in other libraries

1. Using Python's built-in functions
Python provides built-in functions to perform one-hot encoding using list comprehension and dictionary comprehension. Here's an example:

# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = ['red', 'blue', 'green', 'yellow']

# perform one-hot encoding using list comprehension
encoded_data = [[1 if c == category else 0 for category in categories] for c in data]

print(encoded_data)

Output

[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]

Apart from NumPy, there are several Python libraries that support one-hot encoding. Let's explore how to implement one-hot encoding using different libraries one by one:

2. Using Scikit-learn
Scikit-learn is a popular machine learning library in Python that provides many tools for data preprocessing, including one-hot encoding. Here are few examples using different classes in Scikit-learn.

2.1 Using OneHotEncoder
The OneHotEncoder class takes categorical data as input and transforms it into a binary one-hot encoded representation. It provides methods such as fit, transform, and fit_transform to fit the encoder to the data and perform the encoding.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = np.array([['red'], ['blue'], ['green'], ['yellow']])

# create encoder object
encoder = OneHotEncoder(categories=[categories])

# fit and transform data
encoded_data = encoder.fit_transform(data)

# convert sparse matrix to numpy array
encoded_data = encoded_data.toarray()

print(encoded_data)

Output

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

2.2 Using LabelBinarizer
The LabelBinarizer class takes categorical target data and transforms it into a binary representation suitable for classification tasks. It provides methods such as fit, transform, and fit_transform to fit the binarizer to the data and perform the transformation.

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

Output

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

3. Using Pandas
Pandas is another popular library in Python for data manipulation and analysis. Pandas provides a convenient get_dummies() function to perform one-hot encoding. Here's an example:

import pandas as pd

# create example data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'yellow']})

# perform one-hot encoding
encoded_data = pd.get_dummies(data)

print(encoded_data)

Output

   color_blue  color_green  color_red  color_yellow
0           0            0          1             0
1           1            0          0             0
2           0            1          0             0
3           0            0          0             1

4. Using Keras
Keras is a popular high-level neural network API in Python that provides many tools for deep learning. Keras provides a convenient to_categorical() function to perform one-hot encoding. Here's an example:

from keras.utils import to_categorical

# create example data
data = ['red', 'blue', 'green', 'yellow']

# perform one-hot encoding
encoded_data = to_categorical(data)

print(encoded_data)

Output

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

5. Using TensorFlow
TensorFlow is another popular deep learning library in Python that provides many tools for machine learning. TensorFlow provides a convenient one_hot() function to perform one-hot encoding. Here's an example:

import tensorflow as tf

# create example data
categories = ['red', 'blue', 'green', 'yellow']
data = tf.constant(['red', 'blue', 'green', 'yellow'])

# perform one-hot encoding
encoded_data = tf.one_hot(tf.squeeze(tf.cast(tf.argmax(tf.cast(tf.equal(tf.expand_dims(data, 1), categories), tf.int32), axis=1), tf.uint8)), depth=len(categories))

print(encoded_data.numpy())

Output

[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]]

Use-cases in Machine Learning

Let's consider a few examples to better understand the one-hot encoding technique.

1. One-Hot Encoding for Colors
Suppose we have a dataset of shirts that we want to use to train a machine learning model to predict the color of a shirt. The color attribute has four categories: red, blue, green, and yellow.We can represent these categories using one-hot encoding as follows:


import numpy as np

categories = ['red', 'blue', 'green', 'yellow']

# Create an empty array to hold the one-hot encoded vectors
one_hot = np.zeros((len(categories), len(categories)), dtype=np.int32)

# Encode each category as a binary vector
for i, category in enumerate(categories):
    one_hot[i, i] = 1

# Print the resulting one-hot encoded array
print(one_hot)

Color	Red	Blue	Green	Yellow
Red	1	0	0	0
Blue	0	1	0	0
Green	0	0	1	0
Yellow	0	0	0	1

2. One-Hot Encoding for Genres
Suppose we have a dataset of movies that we want to use to train a machine learning model to predict the genre of a movie. The genre attribute has five categories: action, comedy, drama, horror, and romance. We can represent these categories using one-hot encoding as follows:

Genre	Action	Comedy	Drama	Horror	Romance
Action	1	0	0	0	0
Comedy	0	1	0	0	0
Drama	0	0	1	0	0
Horror	0	0	0	1	0
Romance	0	0	0	0	1

3. One-Hot Encoding for Letters
Suppose we have a dataset of letters that we want to use to train a machine learning model to predict the position of a letter in the alphabet. The letter attribute has 26 categories: A, B, C, ..., Z. We can represent these categories using one-hot encoding as follows:

Letter	A	B	C	...	Y	Z
A	1	0	0	...	0	0 
B	0	1	0	...	0	0
C	0	0	1	...	0	0
...	.	.	.	...	.	.
Y	0	0	0	...	1  	0
Z       0       0       0       ...     .       1

Importance in Machine Learning

One-hot encoding is important in machine learning for several reasons:

Enables the use of categorical variables in machine learning algorithms: Many machine learning algorithms require input variables to be numerical. One-hot encoding allows us to represent categorical variables as numerical variables, which can be used as input to these algorithms.
Avoids ordering bias: When encoding categorical variables using numbers (e.g., assigning the values 1, 2, 3 to three different categories), some algorithms may assume an inherent ordering or magnitude of the categories, which may not be true. One-hot encoding removes any implicit ordering or magnitude assumptions, and treats each category as a distinct and equally important feature.
Reduces the impact of the curse of dimensionality: One-hot encoding can increase the dimensionality of the feature space, but it also helps to reduce the impact of the curse of dimensionality. By creating a sparse matrix of binary values, one-hot encoding can reduce the number of non-zero elements in a high-dimensional feature space, which can help to improve the efficiency and accuracy of some machine learning algorithms.
Supports non-parametric models: One-hot encoding is particularly useful for non-parametric models, such as decision trees, random forests, and support vector machines, which can handle categorical data directly. These models can benefit from the sparsity and independence of one-hot encoded features.

Overall, one-hot encoding is an important technique for preprocessing categorical data in machine learning, and is a widely used method for representing categorical variables as numerical variables that can be used as input to machine learning algorithms.

With this article at OpenGenus, you must have the complete idea on how to implement one-hot encoded array in python.