SSD MobileNetV1 architecture

Internship at OpenGenus

Get FREE domain for 1st year and build your brand new site

MobileNet is one of the many deep convolution models available to us. In this article, we have dived deep into what is MobileNet, what makes it special amongst other convolution neural network architectures, Single-Shot multibox Detection (SSD) how MobileNet V1 SSD came into being and its architecture.

Table of Contents:

  • MobileNet
    • MobileNet V1 architecture
  • Single-Shot multibox Detector
  • SSD MobileNet V1 architechture

MobileNet

MobileNet is an architechture model of the convolution neural network (CNN) that explicitly focuses on Image Classification for mobile applications. Rather than using the standard convolution layers, it uses Depth wise separable convolution layers. What makes this model stand out is that its architechture lessens the computational cost and very low computational power is needed to run or apply transfer learning.

MobileNet V1 architecture

MobileNet V1 is an adaptation of the MobileNet model.

MobileNet-V1-1

The above image depicts the depth wise separable convolution. In mobileNet V1, the convolution box in the given image that consists of depthwise and point wise convolutions is repeated 13 times after the initial convolution layer .The table below gives its detailed architecture.

Type/Stride Filter shape Input size
Conv/s2 3 x 3 x 3 x 32 224 x 224 x 3
Conv dw/s1 3 x 3 x 32 dw 112 x 112 x 32
Conv/s1 1 x 1 x 32 x 64 112 x 112 x 32
Conv dw/s2 3 x 3 x 64 dw 112 x 112 x 64
Conv/s1 1 x 1 x 64 x 128 56 x 56 x 128
Conv dw/s1 3 x 3 x 128 dw 56 x 56 x 128
Conv/s1 1 x 1 x 128 x 128 56 x 56 x 128
Conv dw/s2 3 x 3 x 128 dw 56 x 56 x 128
Conv/s1 1 x 1 x 128 x 256 28 x 28 x 128
Conv dw/s1 3 x 3 x 256 dw 28 x 28 x 256
Conv/s1 1 x 1 x 256 x 256 28 x 28 x 256
Conv dw/s1 3 x 3 x 256 dw 28 x 28 x 256
Conv/s1 1 x 1 x 256 x 512 14 x 14 x 256
Conv dw/s1 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 512 14 x 14 x 256
Conv dw/s1 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 512 14 x 14 x 256
Conv dw/s1 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 512 14 x 14 x 256
Conv dw/s1 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 512 14 x 14 x 256
Conv dw/s1 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 512 14 x 14 x 256
Conv dw/s2 3 x 3 x 512 dw 14 x 14 x 512
Conv/s1 1 x 1 x 512 x 1024 7 x 7 x 512
Conv dw/s2 3 x 3 x 1024 dw 7 x 7 x 1024
Conv/s1 1 x 1 x 1024 x 1024 7 x 7 x 1024
Avg Pool/s1 Pool 7 x 7 7 x 7 x 1024
FC/s1 1024 x 1000 1 x 1 x 1024
Softmax/s1 Classifier 1 x 1 x 1000

In the above table, in convolution layer mentioned as Conv, the fourth parameter in the column 'Filter shape' represents the number of filters for the respective conolution layer.

Single Shot Multibox Detector

Single shot Multibox detector is an algorithm which takes only one shot to detect many objects in the image using multibox. It uses a single deep neural network to achieve this. This detector works at a variety of different scales, so it is able to detect objects of various different sizes/scales in the image.Given below is the architecture of SSD:

SSD-architecture

Generally, SSD uses an auxillary network for feature extraction. This is also called as base network. In the above image, the algorithm uses VGG to extract feature maps. But the last few layers of VGG like the maxpool, FC and Softmax are omitted and the output of VGG is used as feature maps on which to base detections.

More convolution layers are added in which the intermediate tensors are kept, so that a stack of feature maps with variety of sizes are generated to make detection. Let us assume, that we have a feature layer of size a x b and we have c channels. Then the convolution (mostly 3 x 3) is applied on a x b x c feature layer. So for each location of the objects identified, there are k bounding boxes possible each with a probability score assigned to it.

At last, Non-max supression is used to make sure that there's only one bounded box around an object. Its achieved as folows:

Firstly, all the bounding boxes around the objects that has probability less than a certain threshold (say 0.6). Then of the remaining boxes, the box with the greatest probability factor is looked upon for each and every object and the other boxes except the one with maximum probability factor is supressed. Thus leaving only a single bounded box around a single identified object.

Since in this, all the boxes with non-maximum values are supressed, the method is called Non-maxima Supression.

SSD MobileNet V1 architecture

There are some practical limitations while deploying and running complex and high power consuming neural networks in real-time applications on cut-rate technology. Since, SSD is independent of its base network, MobileNet was used as the base network of SSD to tackle this problem.

This is known as MobileNet SSD.

When MobileNet V1 is used along with SSD, the last few layers such as the FC, Maxpool and Softmax are omitted. So, the outputs from the final convolution layer in the MobileNet is used, along with convolutiong it a few more times to obtain a stack of feature maps.These are then used as inputs for its detection heads. Its architecture can be modified as per required.The table below gives one of its architecture in detail.

Type/Stride Filter shape Input size
Conv/s2 3 x 3 x 3 x 32 300 x 300 x 3
Conv dw/s1 3 x 3 x 32 dw 150 x 150 x 32
Conv/s1 1 x 1 x 32 x 64 150 x 150 x 32
Conv dw/s2 3 x 3 x 64 dw 150 x 150 x 64
Conv/s1 1 x 1 x 64 x 128 75 x 75 x 64
Conv dw/s1 3 x 3 x 128 dw 75 x 75 x 128
Conv/s1 1 x 1 x 128 x 128 75 x 75 x 128
Conv dw/s2 3 x 3 x 128 dw 75 x 75 x 128
Conv/s1 1 x 1 x 128 x 256 38 x 38 x 128
Conv dw/s1 3 x 3 x 256 dw 38 x 38 x 256
Conv/s1 1 x 1 x 256 x 512 38 x 38 x 256
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv dw/s1 3 x 3 x 512 dw 38 x 38 x 512
Conv/s1 1 x 1 x 512 x 512 38 x 38 x 512
Conv/s2 3 x 3 x 512 x 1024 38 x 38 x 512
Conv/s1 1 x 1 x 1024 x 1024 19 x 19 x 1024
Conv/s1 1 x 1 x 1024 x 256 19 x 19 x 1024
Conv/s2 3 x 3 x 256 x 512 19 x 19 x 256
Conv/s1 1 x 1 x 512 x 128 10 x 10 x 512
Conv/s2 3 x 3 x 128 x 256 10 x 10 x 128
Conv/s1 1 x 1 x 256 x 128 5 x 5 x 256
Conv/s2 3 x 3 x 128 x 256 5 x 5 x 128
Conv/s1 1 x 1 x 256 x 128 3 x 3 x 256
Conv/s1 3 x 3 x 128 x 256 3 x 3 x 128
Conv/s1 1 x 1 x 256 x 128 1 x 1 x 256
Conv/s1 3 x 3 x 128 x 256 1 x 1 x 128

Given below is a pictorial representation of MobileNet V1 based SSD architecture pattern.

Mobilenet-v1-based-modified-SSD-network-architecture

By the end of this article at OpenGenus, you will have a clear idea on SSD MobileNet architecture.