Implementing Simple CNN model in PyTorch

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

Neural networks and deep learning are transforming artificial intelligence, automating complex tasks and enabling machines to learn from data like never before. In this OpenGenus article, we will learn about implementing a simple CNN model using PyTorch Deep Learning framework.

Table of contents:

  1. Understanding AI
  2. Introduction to Machine Learning and Deep Learning
  3. Neural Networks and its components
  4. Neural Networks vs. Deep Learning
  5. Types of Neural Networks
  6. Convolutional Neural Networks (CNNs)
  7. A Simple CNN Model
  8. Conclusion

Understanding AI

Artificial intelligence is a concept of creating machines that can think, learn, and solve problems in ways that mimic human cognition. AI encompasses a broad range of technologies aimed at enabling machines to perform tasks such as understanding natural language, recognizing patterns, making decisions, and solving complex problems. It can be categorized into three major types:

1. ANI (Artificial Narrow Intelligence): This type of AI specialises in performing specific tasks once trained on labeled data but do not possess general cognitive abilities. Examples include voice assistants like Siri and Alexa, image recognition systems, and recommendation algorithms in online shopping. This category focuses on ML, Image Recognition, Robotics, Computer Vision, NLP, etc.

2. AGI (Artificial General Intelligence): AGI aims to match human intelligence and abilities across different tasks. Currently hypothetical, it represents AI systems with human-like cognitive abilities across various domains.

3. ASI (Artificial Superintelligence): ASI would surpass human intelligence across all domains and activities.

The above classifications highlight that while ANI technologies are currently dominant and advancing rapidly, AGI and ASI remain aspirational goals for future AI development.

Introduction to Machine Learning and Deep Learning

Machine learning is a subset of AI, focusing on algorithms that allow computers to learn from and make predictions based on data. Deep learning, in turn, is a subset of machine learning that employs neural networks with many layers to process data and extract features automatically. Neurons are fundamental units in neural networks, mimicking the way the human brain processes information through interconnected nodes.

Neural Networks and its components

Neural networks are layers of interconnected "neurons" that mimic how our brains process information. Each neuron processes input data and passes the result to the next layer. These networks learn from data to recognize patterns, make predictions, and solve complex problems. Some of the basic components used in models are :

1. Neurons:
A neuron is a basic unit used in neural networks. They receive input signals, process them using an activation function, and then pass the output to the next layer of neurons. Each neuron is connected to neurons in the previous and next layers, forming a network.

2. Activation Functions:
Activation functions determine the output of a neuron given its input. They introduce non-linearity into the network, enabling it to learn complex patterns. Common activation functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh.

3. Layers:
Neural networks are organized into layers of neurons. There are typically three types of layers:
Input Layer: Passes input data into to the hidden layers.
Hidden Layers: These are intermediate layers between input and output layers, which perform computations and learn patterns in the given data.
Output Layer: Produces the final output based on the computations of the hidden layers.

4. Weights and Biases:
Each connection between neurons has a weight that signifies its importance. During training, these weights are adjusted to minimize the difference between predicted and actual outputs.
Each neuron also has a bias term that allows the network to account for input variability not captured by the weights.

5. Optimization Algorithm:
They adjust the weights and biases of the neural network during training to minimize loss function and update parameters in the direction that reduces the error between predicted and actual outputs.

6. Forward and Backward propogation:
In forward propogation, the data flows from the input layer through the hidden layers to the output layer, where predictions are made. The primary goal of forward propagation is to generate predictions or outputs based on the current weights and biases of the neural network.
Backpropagation is used to update the weights and biases of the CNN based on the error calculated by the loss function, whose main purpose is to optimize the neural network’s parameters so that it learns to make more accurate predictions over time.

Both processes are essential for training CNNs effectively on tasks such as image classification, object detection, and image segmentation.

7. Learning Rate:
The learning rate controls how much the weights are adjusted with respect to the loss gradient during training. It's a critical hyperparameter that affects how quickly or slowly the network learns.

8. Feature maps:
A feature map is like a special "highlight" sheet that shows where and how strongly certain features like edges, shapes, or textures, are present in the photo.

Neural Networks vs. Deep Learning

Neural networks are the building blocks of deep learning models. They consist of interconnected nodes arranged in layers, where each node performs a simple computation and passes the result to nodes in the next layer. Deep learning extends this concept by using neural networks with many layers (deep neural networks) to learn hierarchical representations of data.

Types of Neural Networks

Neural networks come in various architectures suited for different types of data and tasks:
Supervised learning: involves learning from labeled data to predict or classify new data. Artificial Neural Networks (ANN), Feedforward Neural Networks (FNN) and Convolutional Neural Networks (CNN) are some of the common examples.

Unsupervised learning: Learns patterns and relationships in data, often without explicit labels.Autoencoders, Generative Adversarial Networks (GAN) are some examples.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) is a supervised learning model, particularly effective for tasks involving images or video analysis and can effectively learn and recognize patterns such as edges, textures, and shapes through a series of hierarchical layers.
Let's delve deeper into Convolutional Neural Networks (CNNs):

1. Architecture of CNNs:
i. Convolutional Layers: These layers are fundamental building blocks in CNNs used to extract important features such as edges, textures, and shapes from the input data. It includes the following layers:
* Convolution operation: involves taking a small matrix called a kernel (or filter) and sliding it over the input data (usually an image) to perform element-wise matrix multiplication and summing up the results to produce a single output pixel in the feature map. Each filter extracts specific features from the input, such as edges, textures, or more complex patterns.
* Parameters: Parameters include the size of the filter (kernel size), the number of filters, and the stride (step size) with which the filter moves across the input.
* Activation Function: Typically, ReLU (Rectified Linear Unit) is used as the activation function after convolution operations.
ii. Pooling Layers: These layers downsample(reducing dimensions like width and height) the feature maps obtained from convolutional layers. Max pooling and average pooling are common methods used to retain the most important features while reducing computational complexity.
Pooling size (kernel size) and stride determine how pooling is applied across the input.
iii. Fully Connected Layers (Dense Layers): These layers integrate features learned by convolutional and pooling layers into predictions. They connect every neuron in one layer to every neuron in the next layer, producing the final output.
Each connection between neurons has its own weight and bias, which are adjusted during training to improve model accuracy.

2. Example of working of CNN model :
Detect Horizontal Lines in a 3x3 Image using a small image and filter.

Input Image: Consider a grayscale 3x3 image matrix (pixel values represented as integers):

[[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]]

Filter (Kernel): Let's define a simple 2x2 filter that detects horizontal lines. This filter has positive weights on the top row and negative weights on the bottom row. It will detect transitions from light to dark (or dark to light) pixels horizontally.

[[1, 1],
 [-1, -1]]

Applying the Filter: Slide the filter over the input image, computing the dot product at each position. Start with the top-left 2x2 sub-region of the image.

[[1, 2],
 [4, 5]]

Compute the product with the filter:

(1 * 1) + (2 * 1) + (4 * -1) + (5 * -1) = 1 + 2 - 4 - 5 = -6

Replace the center pixel in the output feature map with the result (-6 in this case).

Sliding: Slide the filter to the right by one pixel and compute again.

[[2, 3],
 [5, 6]]

(2 * 1) + (3 * 1) + (5 * -1) + (6 * -1) = 2 + 3 - 5 - 6 = -6

Output Feature Map: After sliding the filter across the entire image, the resulting 2x2 output feature map might look like this:

[[-6, -6],
 [-6, -6]]

Each value in the output feature map represents the response of the filter to the corresponding sub-region of the input image.

A Simple CNN Model

Here's a basic example of how to implement a simple CNN model using PyTorch:

1. Import the necessary header files :

  • import torch is the core library of PyTorch, providing foundational tensor operations and neural network functionalities.
  • import numpy as np is utilized for numerical operations and managing arrays, integral to data manipulation and preprocessing.
  • import torchvision and import torchvision.transforms as transforms are used for working with image datasets, pre-trained models, and image preprocessing techniques, aiding in tasks related to computer vision.
import time
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import random_split, DataLoader
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torch.nn.functional as F

2. Set the parameters :
The below parameters are critical for configuring the training and validation process of a machine learning model.

  • batch_size specifies number of samples processed in each training iteration and helps balance between training speed and memory usage.
  • valid_size specifies the percentage of the dataset that will be reserved for validation, allowing the model to be evaluated on unseen data during training to monitor performance and prevent overfitting.
  • num_epochs sets the number of times the entire dataset will be passed through the model during training.
  • num_workers specifies the number of workers to use for data loading and can be adjusted for performance optimization in more complex scenarios.
batch_size = 64
valid_size = 0.2
num_epochs = 20
num_workers = 4

3. Prepare the dataset :
transform is used for preparing images before they are used to train a machine learning model.

  • transforms.ToTensor() converts the image from a regular picture format into a structured numerical format (tensor) that the machine learning model can work with. * transforms.Normalize() adjusts the colors of the images so that they are standardized. This means that every image will have its colors adjusted to a similar range, which helps the model learn better and more consistently.
  • The steps make sure that all images are in a format and range that the model can work with effectively.
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

4. Load and split dataset :
Using the CIFAR10 dataset, let's download and load the dataset, which contains images of different classes for training and testing purposes.

  • full_train_data holds the training images and labels, while test_data contains the test images and labels.
  • valid_size parameter specifies that 20% of the training data should be used for validation. This percentage is calculated and used to split full_train_data into train_data and valid_data using random_split, ensuring that 80% of the data is used for training and 20% for validation.
  • DataLoader class is then used to create loaders for each dataset: train_loader for training the model with shuffled batches, valid_loader for validating the model with non-shuffled batches, and test_loader for evaluating the model using the entire test dataset.
full_train_data = torchvision.datasets.CIFAR10('data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.CIFAR10('data', train=False, download=True, transform=transform)

num_train = len(full_train_data)
# store the test_dataset size as 20% of the total dataset
split = int(np.floor(valid_size * num_train)) 
train_size = num_train - split # store the train_dataset size (80% in our case)

# Random split of the dataset
train_data, valid_data = random_split(full_train_data, [train_size, split])

#train the model using 80% of the dataset
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=num_workers)
#validate the working a validation dataset which contains 20% of the dataset
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, num_workers=num_workers)
#run the test using the entire dataset
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=num_workers)

5. Define the CNN model :
There are several ways to optimize and customize the CNN model, based on the test accuracy and requirements. Let's look at a sample model and understand the various layers.

  • classes represents different categories in the CIFAR-10 dataset, such as 'plane', 'vehicle', 'bird', etc.
  • Net defines a Convolutional Neural Network (CNN) model using PyTorch's nn.Module.
  • The Net model consists of:
    i. Convolutional Layers: Convolutional operations are applied to the input images to extract features using 3 sequential blocks of layers for extracting features. Conv2d is used to apply convolution operations to the input image, while BatchNorm2d and ReLU are used for batch normalization and activation functions respectively.
    ii. Pooling Layers: Used to reduce the spatial dimensions of the feature maps, helping to make the model more computationally efficient and less sensitive to small changes in the input.
    iii. Fully Connected Layers: These layers are used to make final predictions based on the features extracted by the convolutional layers. They start by increasing the size of the image and then simplify it to focus on the most important details, and finally, produce a list showing how likely the image is to belong to each possible category, like 'cat', 'dog', or 'plane'. Essentially, these layers help the network make a clear and final decision about the image.
    iv. Dropout Layers: These layers ensure that the network learns to work well even when some parts are missing, which helps it to generalize better and not just memorize the training data.
    v. forward method: specifies how input data flows through the network: it applies convolution, pooling, and activation functions, reshapes the data, and passes it through fully connected layers with dropout applied at various stages.
  • Finally, an instance of the Net class is created with net = Net(), initializing the CNN model.
classes = ['plane', 'vehicle', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Define the CNN model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.fc1 = nn.Linear(128 * 4 * 4, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = x.view(-1, 128 * 4 * 4)
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.dropout(F.relu(self.fc2(x)))
        x = self.fc3(x)
        return x

net = Net()

6. Define loss function and optimizer:
The criterion is used to measure how good or bad the network's predictions are, and optimizer is used to adjust the network's weights to improve its performance based on the measurement.

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

7. Training, validation and loss calculation:

  • valid_loss_min is a variable to keep track of the lowest validation loss seen so far.
  • For each epoch, the training and validation process are repeated multiple times.
  • The training phase:
    i. The network is put into training mode net.train(), and for each batch of data, we clear old gradients optimizer.zero_grad()
    ii. Make predictions net(data) and calculate the loss criterion(net(data), target).
    ii. Adjust the weights to reduce this loss loss.backward(), and accumulate the total training loss loss.item() * data.size(0).
  • The Validation Phase:
    i. The network is switched to evaluation mode net.eval().
    ii. For each batch of validation data, we calculate the loss without updating the weights torch.no_grad(), and net(data) gets the network’s predictions for the validation data. criterion(output, target) calculates how wrong the predictions are for this batch.
    iii. Accumulate the total validation loss loss.item() * data.size(0).
  • Calculate average loss:
    i. After each epoch, the code calculates the average loss for both training and validation data by dividing the total loss by the number of samples in each dataset.
    ii. Measure the time taken for each epoch. Additionally, check if the validation loss has improved compared to previous epochs and saves the best model’s state to a file net_cifar10.pt.
    This process helps track the model's performance, monitor progress, and ensure the best-performing model is saved.
valid_loss_min = np.Inf

for epoch in range(num_epochs):
    start_time = time.time()
    train_loss = 0.0
    valid_loss = 0.0
    
    # Training
    net.train()
    for data, target in train_loader:
        data, target = data, target
        optimizer.zero_grad()
        output = net(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data.size(0)
    # Validation
    net.eval()
    with torch.no_grad():
        for data, target in valid_loader:
            data, target = data, target
            output = net(data)
            loss = criterion(output, target)
            valid_loss += loss.item() * data.size(0)
    
    # Calculate average loss
    train_loss /= len(train_loader.dataset)
    valid_loss /= len(valid_loader.dataset)

    end_time = time.time()
    epoch_time = end_time - start_time

    print(f'Epoch: {epoch+1}/{num_epochs} | Time: {epoch_time:.3f}s | Training Loss: {train_loss:.4f} | Validation Loss: {valid_loss:.4f}')

    # Save model if validation loss decreases
    if valid_loss <= valid_loss_min:
        print(f'Validation loss decreased ({valid_loss_min:.4f} --> {valid_loss:.4f}). Saving model as net_cifar10.pt')
        torch.save(net.state_dict(), 'net_cifar10.pt')
        valid_loss_min = valid_loss

8. Load and test the best model:

  • net.load_state_dict(torch.load('net_cifar10.pt')) loads the best saved model from a file named net_cifar10.pt.
  • Initialize variables test_loss, class_correct, and class_total to keep track of the total loss on the test data and how well the model performs overall.
  • TheTesting Phase:
    i. The model is set to evaluation mode with net.eval(), and torch.no_grad() is again used to stop the model from calculating gradients, which saves memory.
    ii. For each batch of test data, the model makes predictions, calculates the loss, and updates the total test loss.
    iii. It then checks if each prediction is correct and keeps a count of correct predictions for each class and the total number of samples for each class.
  • Print the test results by calculating the model's performance on a test dataset, calculates the average loss and accuracy, and prints out how well the model performs overall.
# Load the best model
net.load_state_dict(torch.load('net_cifar10.pt'))
print('Finished Training')

test_loss = 0.0
class_correct = [0] * 10
class_total = [0] * 10

net.eval()
with torch.no_grad():
    for data, target in test_loader:
        data, target = data, target
        output = net(data)
        loss = criterion(output, target)
        test_loss += loss.item() * data.size(0)
        _, pred = torch.max(output, 1)
        correct = pred.eq(target.view_as(pred))
        for i in range(len(target)):
            label = target[i].item()
            class_correct[label] += correct[i].item()
            class_total[label] += 1
# Print test results
test_loss /= len(test_loader.dataset)
print(f'Test Loss: {test_loss:.6f}')

overall_accuracy = 100. * np.sum(class_correct) / np.sum(class_total)
print(f'\nTest Accuracy (Overall): {overall_accuracy:.2f}%')

After downloading the dataset CIFAR-10, the model is trained and evaluated on a random sample test dataset from the same CIFAR-10 dataset.

Epoch: 1/20 | Time: 218.234s | Training Loss: 1.6360 | Validation Loss: 1.2467
Validation loss decreased (inf --> 1.2467). Saving model as net_cifar10.pt
Epoch: 2/20 | Time: 223.496s | Training Loss: 1.2740 | Validation Loss: 1.0825
Validation loss decreased (1.2467 --> 1.0825). Saving model as net_cifar10.pt
Epoch: 3/20 | Time: 232.533s | Training Loss: 1.1203 | Validation Loss: 0.9540
Validation loss decreased (1.0825 --> 0.9540). Saving model as net_cifar10.pt
Epoch: 4/20 | Time: 230.899s | Training Loss: 1.0092 | Validation Loss: 0.8614
Validation loss decreased (0.9540 --> 0.8614). Saving model as net_cifar10.pt
Epoch: 5/20 | Time: 231.082s | Training Loss: 0.9349 | Validation Loss: 0.8214
Validation loss decreased (0.8614 --> 0.8214). Saving model as net_cifar10.pt
Epoch: 6/20 | Time: 252.445s | Training Loss: 0.8686 | Validation Loss: 0.8234
Epoch: 7/20 | Time: 234.719s | Training Loss: 0.8168 | Validation Loss: 0.7961
Validation loss decreased (0.8214 --> 0.7961). Saving model as net_cifar10.pt
Epoch: 8/20 | Time: 244.801s | Training Loss: 0.7701 | Validation Loss: 0.7754
Validation loss decreased (0.7961 --> 0.7754). Saving model as net_cifar10.pt
Epoch: 9/20 | Time: 284.708s | Training Loss: 0.7218 | Validation Loss: 0.7546
Validation loss decreased (0.7754 --> 0.7546). Saving model as net_cifar10.pt
Epoch: 10/20 | Time: 255.791s | Training Loss: 0.6918 | Validation Loss: 0.7677
Epoch: 11/20 | Time: 203.933s | Training Loss: 0.6485 | Validation Loss: 0.7009
Validation loss decreased (0.7546 --> 0.7009). Saving model as net_cifar10.pt
Epoch: 12/20 | Time: 393.549s | Training Loss: 0.6176 | Validation Loss: 0.7026
Epoch: 13/20 | Time: 253.282s | Training Loss: 0.5890 | Validation Loss: 0.6831
Validation loss decreased (0.7009 --> 0.6831). Saving model as net_cifar10.pt
Epoch: 14/20 | Time: 284.252s | Training Loss: 0.5553 | Validation Loss: 0.6826
Validation loss decreased (0.6831 --> 0.6826). Saving model as net_cifar10.pt
Epoch: 15/20 | Time: 229.772s | Training Loss: 0.5271 | Validation Loss: 0.6881
Epoch: 16/20 | Time: 257.720s | Training Loss: 0.5061 | Validation Loss: 0.6940
Epoch: 17/20 | Time: 271.851s | Training Loss: 0.4801 | Validation Loss: 0.7251
Epoch: 18/20 | Time: 240.566s | Training Loss: 0.4522 | Validation Loss: 0.6837
Epoch: 19/20 | Time: 243.856s | Training Loss: 0.4357 | Validation Loss: 0.6817
Validation loss decreased (0.6826 --> 0.6817). Saving model as net_cifar10.pt
Epoch: 20/20 | Time: 278.209s | Training Loss: 0.4215 | Validation Loss: 0.7156
Finished Training
Test Loss: 0.712707

Test Accuracy (Overall): 78.11%

Conclusion

In conclusion, this article provides a foundational approach to implement a basic CNN model using PyTorch. This model is generally used for image classificaton tasks and can provide more complex architectures and techniques to tackle a variety of problems in deep learning. They are pivotal in handling visual data, and implementing them in frameworks like PyTorch allowing efficient development of sophisticated machine learning models.
As AI continues to advance, these technologies will play an increasingly crucial role in shaping the future of technology and society.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.