Understanding the VGG19 Architecture


VGG19 is a variant of VGG model which in short consists of 19 layers (16 convolution layers, 3 Fully connected layer, 5 MaxPool layers and 1 SoftMax layer). There are other variants of VGG like VGG11, VGG16 and others. VGG19 has 19.6 billion FLOPs.

Background

AlexNet came out in 2012 and it improved on the traditional Convolutional neural networks, So we can understand VGG as a successor of the AlexNet but it was created by a different group named as Visual Geometry Group at Oxford's and hence the name VGG, It carries and uses some ideas from it's predecessors and improves on them and uses deep Convolutional neural layers to improve accuracy.

Let's explore what VGG19 is and compare it with some of other versions of the VGG architecture and also see some useful and practical applications of the VGG architecture.

Before diving in and looking at what VGG19 Architecture is let's take a look at ImageNet and a basic knowledge of CNN.

First of all lets explore what ImageNet is. It is an Image database consisting of 14,197,122 images organized according to the WordNet hierarchy. this is a initiative to help researchers, students and others in the field of image and vision research.

ImageNet also hosts contests from which one was ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) which challenged researchers around the world to come up with solutions that yields the lowest top-1 and top-5 error rates (top-5 error rate would be the percent of images where the correct label is not one of the model’s five most likely labels). The competition gives out a 1,000 class training set of 1.2 million images, a validation set of 50 thousand images and a test set of 150 thousand images.

and here comes the VGG Architecture, in 2014 it out-shined other state of the art models and is still preferred for a lot of challenging problems.

VGG19

So in simple language VGG is a deep CNN used to classify images. The layers in VGG19 model are as follows:

  • Conv3x3 (64)
  • Conv3x3 (64)
  • MaxPool
  • Conv3x3 (128)
  • Conv3x3 (128)
  • MaxPool
  • Conv3x3 (256)
  • Conv3x3 (256)
  • Conv3x3 (256)
  • Conv3x3 (256)
  • MaxPool
  • Conv3x3 (512)
  • Conv3x3 (512)
  • Conv3x3 (512)
  • Conv3x3 (512)
  • MaxPool
  • Conv3x3 (512)
  • Conv3x3 (512)
  • Conv3x3 (512)
  • Conv3x3 (512)
  • MaxPool
  • Fully Connected (4096)
  • Fully Connected (4096)
  • Fully Connected (1000)
  • SoftMax

Architecture

  • A fixed size of (224 * 224) RGB image was given as input to this network which means that the matrix was of shape (224,224,3).
  • The only preprocessing that was done is that they subtracted the mean RGB value from each pixel, computed over the whole training set.
  • Used kernels of (3 * 3) size with a stride size of 1 pixel, this enabled them to cover the whole notion of the image.
  • spatial padding was used to preserve the spatial resolution of the image.
  • max pooling was performed over a 2 * 2 pixel windows with sride 2.
  • this was followed by Rectified linear unit(ReLu) to introduce non-linearity to make the model classify better and to improve computational time as the previous models used tanh or sigmoid functions this proved much better than those.
  • implemented three fully connected layers from which first two were of size 4096 and after that a layer with 1000 channels for 1000-way ILSVRC classification and the final layer is a softmax function.

The column E in the following table is for VGG19 (other columns are for other variants of VGG models):

Configuration table from the actual paper

Table 1 : Actual configuration of the networks, the ReLu layers are not shown for the sake of brevity.

Comparison and explanation of the columns (Architecture) in Table 1

so let me first explain the column E as that is the VGG19 architecture, it contained 16 layers of CNNs and three fully connected layers and a final layer for softmax function, the fully connected layers and the final layer are going to remain the same for all the network architectures.

  • A : Contains 8 CNN layers so total of 11 layers including the fully connected(FC) layers and and has no difference internally except the number of layers.
  • A-LRN : This is also similar to the column A but has one extra step of Local response normalization(LRN) which implements lateral inhibition in the layer by which i mean that it makes a significant peak and thus creating a local maxima which increases the sensory perception which we may want in our CNN but it was seen that for this specific case that is ILSVRC it wasn't increasing accuracy and the overall network was taking more time to train.
  • C : This contains 13 CNN layers and 16 including the FC layers, In this architecture authors have used a conv filter of (1 * 1) just to introduce non-linearity and thus better discrimination.
  • B and D : These columns just add extra CNN layers and are of 13 layers and 16 layers respectively.

Results

Comparison between other state of the art models presented at ILSVRC.

Comparison Result

By this we can see that deep CNNs can achieve really better precision by implementing more layers and can reduce computation by implementing layers of 3 * 3 conv filter rather than just a 7 * 7 filter which eventually reduced the number of parameters and hence the computational time.

Uses of the VGG Neural Network

The main purpose for which the VGG net was designed was to win the ILSVRC but it has been used in many other ways.

  • Used just as a good classification architecture for many other datasets and as the authors made the models available to the public they can be used as is or with modification for other similar tasks also.
  • Transfer learning : can be used for facial recognition tasks also.
  • weights are easily available with other frameworks like keras so they can be tinkered with and used for as one wants.
  • Content and style loss using VGG-19 network

Resources

For the Implementational details and for deep study refer to the original paper.