How single-shot detector (SSD) works?

Last Updated : 23 Jul, 2025

Object detection is a critical task in computer vision, with applications ranging from autonomous driving to image retrieval and surveillance. The Single Shot Detector (SSD) is an advanced algorithm that has revolutionized this field by enabling real-time detection of objects in images. This article delves into the workings of the SSD, its architecture, key advantages, and practical applications.

Introduction to Single-shot Detector

Object detection involves identifying and locating objects within an image. Traditional methods required multiple passes over the image, making them computationally expensive and slow. SSD simplifies this process by detecting objects in a single pass, hence the name "Single Shot Detector." This approach not only speeds up the detection process but also maintains high accuracy, making SSD a popular choice for real-time applications.

Model Architecture

Base Network

The SSD architecture begins with a pre-trained convolutional neural network (CNN) known as the base network. Commonly, networks like VGG16 are used due to their strong feature extraction capabilities. The base network processes the input image and generates feature maps, which are essential for object detection.

Extra Layers

Beyond the base network, SSD includes extra convolutional layers. These layers progressively decrease in size and are responsible for detecting objects at different scales. Each additional layer generates feature maps that contribute to the final detection process.

Feature Maps and Multi-scale Detection

A standout feature of SSD is its use of multi-scale feature maps. These maps capture information at various resolutions, allowing SSD to detect objects of different sizes effectively. Higher resolution feature maps are adept at detecting smaller objects, while lower resolution maps handle larger objects.

Default Boxes (Anchor Boxes)

SSD employs a technique called default boxes (also known as anchor boxes) at each location in the feature maps. These boxes come in various aspect ratios and scales, providing a diverse set of potential object locations. Each default box is associated with two sets of predictions:

  • Class Scores: These scores indicate the likelihood of an object belonging to a specific class.
  • Bounding Box Offsets: These offsets refine the default box to better match the actual object's location.

Predictions

For each default box, SSD predicts:

  • Class Confidences: The probability of the box containing a specific object class.
  • Bounding Box Adjustments: The coordinates to refine the position and size of the default box to match the detected object more precisely.

Loss Function

The SSD loss function combines two components:

  • Localization Loss (Lloc): This measures how accurately the predicted bounding boxes match the ground truth boxes using Smooth L1 loss.
  • Confidence Loss (Lconf): This evaluates the confidence in the predicted class scores using softmax loss.

Non-Maximum Suppression (NMS)

To finalize the detection process, SSD applies Non-Maximum Suppression (NMS). This step eliminates redundant boxes with lower confidence scores, ensuring that only the most confident and relevant predictions are retained.

Steps in Single Shot Detection

  1. Input Image: The image is passed through the base network to extract feature maps.
  2. Feature Extraction: The extra layers process these maps at multiple scales.
  3. Default Boxes Assignment: Default boxes of various sizes and aspect ratios are assigned to each feature map cell.
  4. Prediction: For each default box, class scores and bounding box offsets are predicted.
  5. Loss Calculation: The loss is computed based on localization and confidence.
  6. NMS: Redundant boxes are removed to produce the final set of detections.

Implementation of Single-Shot Detection

Here is a step-by-step implementation of the Single Shot Detector (SSD) with explanations and code snippets for each step.

Step 1: Import Required Libraries

In this step, we import the necessary libraries for building the SSD model.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from torchvision.models import VGG16_Weights

Step 2: Define the SSD Model Class

We define the SSD model class, which includes the base network (VGG16), additional layers for SSD, localization, and confidence layers.

Explanation:

  • Base Network: Uses a pre-trained VGG16 model up to the conv5_3 layer.
  • Extra Layers: Additional convolutional layers to detect objects at multiple scales.
  • Localization and Confidence Layers: These layers predict the bounding boxes and class scores.
class SSD(nn.Module):
def __init__(self, num_classes):
super(SSD, self).__init__()
self.num_classes = num_classes

# Load the pre-trained VGG16 model
vgg = models.vgg16(weights=VGG16_Weights.IMAGENET1K_V1).features
self.features = nn.ModuleList(vgg[:30]) # Use up to the conv5_3 layer

# Additional layers for SSD
self.extras = nn.ModuleList([
nn.Sequential(
nn.Conv2d(512, 1024, kernel_size=3, padding=1, dilation=1),
nn.ReLU(inplace=True)
),
nn.Sequential(
nn.Conv2d(1024, 256, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
nn.ReLU(inplace=True)
),
nn.Sequential(
nn.Conv2d(512, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
nn.ReLU(inplace=True)
),
nn.Sequential(
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3),
nn.ReLU(inplace=True)
),
nn.Sequential(
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3),
nn.ReLU(inplace=True)
)
])

# Localization and class prediction layers
self.loc = nn.ModuleList([
nn.Conv2d(512, 4 * 4, kernel_size=3, padding=1), # 4 default boxes
nn.Conv2d(1024, 6 * 4, kernel_size=3, padding=1), # 6 default boxes
nn.Conv2d(512, 6 * 4, kernel_size=3, padding=1), # 6 default boxes
nn.Conv2d(256, 6 * 4, kernel_size=3, padding=1), # 6 default boxes
nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1), # 4 default boxes
nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1) # 4 default boxes
])

self.conf = nn.ModuleList([
nn.Conv2d(512, 4 * num_classes, kernel_size=3, padding=1),
nn.Conv2d(1024, 6 * num_classes, kernel_size=3, padding=1),
nn.Conv2d(512, 6 * num_classes, kernel_size=3, padding=1),
nn.Conv2d(256, 6 * num_classes, kernel_size=3, padding=1),
nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1),
nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1)
])

Step 3: Implement the Forward Pass

In this step, we define the forward pass method to process the input image through the network layers and generate the localization and confidence predictions.

Explanation:

  • Base Network: Process the input through the base network.
  • Extra Layers: Process the output from the base network through the extra layers.
  • Localization and Confidence Predictions: Apply localization and confidence layers to each feature map from the extra layers and base network.
    def forward(self, x):
locs = []
confs = []

# Apply base network
for k in range(len(self.features)):
x = self.features[k](x)

# Apply localization and confidence layers on conv4_3 and conv7
locs.append(self.loc[0](x).permute(0, 2, 3, 1).contiguous())
confs.append(self.conf[0](x).permute(0, 2, 3, 1).contiguous())

for (i, layer) in enumerate(self.extras):
x = layer(x)
locs.append(self.loc[i+1](x).permute(0, 2, 3, 1).contiguous())
confs.append(self.conf[i+1](x).permute(0, 2, 3, 1).contiguous())

# Reshape and concatenate predictions
locs = torch.cat([o.view(o.size(0), -1) for o in locs], 1)
confs = torch.cat([o.view(o.size(0), -1) for o in confs], 1)

locs = locs.view(locs.size(0), -1, 4)
confs = confs.view(confs.size(0), -1, self.num_classes)

return locs, confs

Step 4: Example Usage

Finally, we demonstrate how to create an instance of the SSD model and pass a sample input through it to obtain localization and confidence predictions.

Explanation:

  • Initialize SSD Model: Create an instance of the SSD model with the desired number of classes.
  • Sample Input: Generate a random sample input image.
  • Forward Pass: Pass the sample input through the SSD model to obtain predictions.
# Example usage
if __name__ == "__main__":
num_classes = 21 # 20 classes + background
ssd = SSD(num_classes)
x = torch.randn(1, 3, 300, 300)
locs, confs = ssd(x)
print("Localization predictions:", locs.size())
print("Confidence predictions:", confs.size())

Complete Implementation

Python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from torchvision.models import VGG16_Weights

class SSD(nn.Module):
    def __init__(self, num_classes):
        super(SSD, self).__init__()
        self.num_classes = num_classes

        # Load the pre-trained VGG16 model
        vgg = models.vgg16(weights=VGG16_Weights.IMAGENET1K_V1).features
        self.features = nn.ModuleList(vgg[:30])  # Use up to the conv5_3 layer

        # Additional layers for SSD
        self.extras = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(512, 1024, kernel_size=3, padding=1, dilation=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(1024, 256, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(512, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(256, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(256, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3),
                nn.ReLU(inplace=True)
            )
        ])

        # Localization and class prediction layers
        self.loc = nn.ModuleList([
            nn.Conv2d(512, 4 * 4, kernel_size=3, padding=1),  # 4 default boxes
            nn.Conv2d(1024, 6 * 4, kernel_size=3, padding=1),  # 6 default boxes
            nn.Conv2d(512, 6 * 4, kernel_size=3, padding=1),  # 6 default boxes
            nn.Conv2d(256, 6 * 4, kernel_size=3, padding=1),  # 6 default boxes
            nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1),  # 4 default boxes
            nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1)   # 4 default boxes
        ])

        self.conf = nn.ModuleList([
            nn.Conv2d(512, 4 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(1024, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(512, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1)
        ])

    def forward(self, x):
        locs = []
        confs = []

        # Apply base network
        for k in range(len(self.features)):
            x = self.features[k](x)
        
        # Apply localization and confidence layers on conv4_3 and conv7
        locs.append(self.loc[0](x).permute(0, 2, 3, 1).contiguous())
        confs.append(self.conf[0](x).permute(0, 2, 3, 1).contiguous())

        for (i, layer) in enumerate(self.extras):
            x = layer(x)
            locs.append(self.loc[i+1](x).permute(0, 2, 3, 1).contiguous())
            confs.append(self.conf[i+1](x).permute(0, 2, 3, 1).contiguous())

        # Reshape and concatenate predictions
        locs = torch.cat([o.view(o.size(0), -1) for o in locs], 1)
        confs = torch.cat([o.view(o.size(0), -1) for o in confs], 1)

        locs = locs.view(locs.size(0), -1, 4)
        confs = confs.view(confs.size(0), -1, self.num_classes)

        return locs, confs

# Example usage
if __name__ == "__main__":
    num_classes = 21  # 20 classes + background
    ssd = SSD(num_classes)
    x = torch.randn(1, 3, 300, 300)
    locs, confs = ssd(x)
    print("Localization predictions:", locs.size())
    print("Confidence predictions:", confs.size())

Output:

Localization predictions: torch.Size([1, 3916, 4])
Confidence predictions: torch.Size([1, 3916, 21])

Key Advantages of SSD

  1. Speed: One of the primary advantages of SSD is its speed. By eliminating the need for a region proposal network, SSD performs detection in a single shot, making it significantly faster than region-based algorithms like Faster R-CNN.
  2. Simplicity: SSD's straightforward architecture simplifies the detection process. The single-pass approach reduces complexity and makes the network easier to train and implement.
  3. Accuracy: SSD achieves competitive accuracy, especially for large objects, due to its multi-scale approach. By utilizing feature maps from multiple layers, SSD effectively detects objects of varying sizes.

Applications

The real-time capabilities of SSD make it suitable for a wide range of applications:

  • Autonomous Driving: Detecting vehicles, pedestrians, and traffic signs in real-time.
  • Surveillance: Monitoring and identifying objects or individuals in security footage.
  • Robotics: Enabling robots to perceive and interact with their environment.
  • Augmented Reality: Detecting and tracking objects for interactive experiences.

Interview Insight : How single-shot detector (SSD) works?

The Single Shot Detector (SSD) is an object detection algorithm that identifies objects in images in a single forward pass of the network. It uses a pre-trained convolutional neural network (like VGG16) as a base to extract feature maps, and adds extra convolutional layers to handle objects at multiple scales. SSD employs default boxes of different aspect ratios and scales at each feature map location, predicting both class scores and bounding box offsets for these boxes. The combined loss function includes localization loss (for bounding box accuracy) and confidence loss (for class prediction accuracy). After generating predictions, Non-Maximum Suppression (NMS) is applied to eliminate redundant boxes and retain the most confident detections, enabling efficient and real-time object detection.

Conclusion

The Single Shot Detector (SSD) represents a significant advancement in object detection technology. Its ability to perform real-time detection with high accuracy and simplicity has made it a preferred choice for many applications. By leveraging multi-scale feature maps and default boxes, SSD efficiently detects objects in a single pass, offering a powerful tool for various computer vision tasks.

Comment

Explore