# SSD object detection: Single Shot MultiBox Detector for real-time processing

SSD is designed for object detection in real-time. Faster R-CNN uses a region proposal network to create boundary boxes and utilizes those boxes to classify objects. While it is considered the start-of-the-art in accuracy, the whole process runs at 7 frames per second. Far below what real-time processing needs. SSD speeds up the process by eliminating the need for the region proposal network. To recover the drop in accuracy, SSD applies a few improvements including multi-scale features and default boxes. These improvements allow SSD to match the Faster R-CNN’s accuracy using **lower resolution images**, which further pushes the speed higher. According to the following comparison, it achieves the real-time processing speed and even beats the accuracy of the Faster R-CNN. (Accuracy is measured as the mean average precision mAP: the precision of the predictions.)

**SSD**

The SSD object detection composes of 2 parts:

- Extract feature maps, and
- Apply convolution filters to detect objects.

SSD uses **VGG16** to extract feature maps. Then it detects objects using the **Conv4_3** layer. For illustration, we draw the Conv4_3 to be 8 × 8 spatially (it should be 38 × 38). For each cell (also called location), it makes 4 object predictions.

Each prediction composes of a boundary box and 21 scores for each class (one extra class for no object), and we pick the highest score as the class for the bounded object. Conv4_3 makes a total of 38 × 38 × 4 predictions: four predictions per cell regardless of the depth of the feature maps. As expected, many predictions contain no object. SSD reserves a class “0” to indicate it has no objects.

Making multiple predictions containing boundary boxes and confidence scores is called multibox.

**Convolutional predictors for object detection**

SSD does not use a delegated region proposal network. Instead, it resolves to a very simple method. It computes both the location and class scores using **small convolution filters**. After extracting the feature maps, SSD applies 3 × 3 convolution filters for each cell to make predictions. (These filters compute the results just like the regular CNN filters.) Each filter outputs 25 channels: 21 scores for each class plus one boundary box (detail on the boundary box later).

For example, in Conv4_3, we apply four 3 × 3 filters to map 512 input channels to 25 output channels.

**Multi-scale feature maps for detection**

At first, we describe how SSD detects objects from a single layer. Actually, it uses multiple layers (**multi-scale feature maps)** to detect objects independently. As CNN reduces the spatial dimension gradually, the resolution of the feature maps also decrease. SSD uses lower resolution layers to detect larger scale objects. For example, the 4× 4 feature maps are used for larger scale object.

SSD adds 6 more auxiliary convolution layers after the VGG16. Five of them will be added for object detection. In three of those layers, we make 6 predictions instead of 4. In total, SSD makes 8732 predictions using 6 layers.

Multi-scale feature maps improve accuracy significantly. Here is the accuracy with different number of feature map layers used for object detection.

**Default boundary box**

The default boundary boxes are equivalent to

anchors in Faster R-CNN.

How do we predict boundary boxes? Just like Deep Learning, we can start with random predictions and use gradient descent to optimize the model. However, during the initial training, the model may fight with each other to determine what shapes (pedestrians or cars) to be optimized for which predictions. Empirical results indicate early training can be very unstable. The boundary box predictions below work well for one category but not for others. We want our initial predictions to be diverse and not looking similar.

If our predictions cover more shapes, like the one below, our model can detect more object types. This kind of head start makes training much easier and more stable.

In real-life, **boundary boxes do not have arbitrary shapes** and sizes. Cars have similar shapes and pedestrians have an approximate aspect ratio of 0.41. In the KITTI dataset used in autonomous driving, the width and height distributions for the boundary boxes are highly clustered.

Conceptually, the ground truth boundary boxes can be partitioned into clusters with each cluster represented by a **default boundary box** (the centroid of the cluster). So, instead of making random guesses, we can start the guesses based on those default boxes.

To keep the complexity low, the default boxes are pre-selected manually and carefully to cover a wide spectrum of real-life objects. SSD also keeps the default boxes to a minimum (4 or 6) with one prediction per default box. Now, instead of using global coordination for the box location, the boundary box predictions are relative to the default boundary boxes at each cell (∆cx, ∆cy, ∆w, ∆h), i.e. the offsets (difference) to the default box at each cell for its center (*cx*, *cy*), the width and the height.

For each feature map layers, it shares the same set of default boxes centered at the corresponding cell. But different layers use different sets of default boxes to customize object detections at different resolutions. The 4 green boxes below illustrate 4 default boundary boxes.

# Choosing default boundary boxes

Default boundary boxes are chosen manually. SSD defines a scale value for each feature map layer. Starting from the left, Conv4_3 detects objects at the smallest scale 0.2 (or 0.1 sometimes), and then increases linearly to the rightmost layer at a scale of 0.9. Combining the scale value with the target aspect ratios, we compute the width and the height of the default boxes. For layers making 6 predictions, SSD starts with 5 target aspect ratios: 1, 2, 3, 1/2, and 1/3. Then the width and the height of the default boxes are calculated as:

and aspect ratio = 1.

YOLO uses k-means clustering on the training dataset to determine those default boundary boxes.

**Matching strategy**

SSD predictions are classified as **positive** matches or negative matches. SSD only uses positive matches in calculating the localization** **cost (the mismatch of the boundary box). If the corresponding **default boundary box** (not the predicted boundary box) has an IoU greater than 0.5 with the ground truth, the match is positive. Otherwise, it is negative. (**IoU, **the intersection over the union, is the ratio between the intersected area over the joined area for two regions.)

Let’s simplify our discussion to 3 default boxes. Only default box 1 and 2 (but not 3) have an IoU greater than 0.5 with the ground truth box above (blue box). So only box 1 and 2 are positive matches. Once we identify the positive matches, we use the corresponding predicted boundary boxes to calculate the cost. This matching strategy nicely partitions what shape of the ground truth that a prediction is responsible for.

This matching strategy encourages each prediction to predict shapes closer to the corresponding default box. Therefore our predictions are more diverse and more stable in the training.

**Multi-scale feature maps & default boundary boxes**

Here is an example of how SSD combines multi-scale feature maps and default boundary boxes to detect objects at different scales and aspect ratios. The dog below matches one default box (in red) in the 4 × 4 feature map layer, but not any default boxes in the higher resolution 8 × 8 feature map. The cat which is smaller is detected only by the 8 × 8 feature map layer in 2 default boxes (in blue).

Higher-resolution feature maps are responsible for detecting small objects. The first layer for object detection *conv4_3* has a spatial dimension of 38 × 38, a pretty large reduction from the input image. Hence, SSD usually performs badly for small objects comparing with other detection methods. If it is a problem, we can mitigate it by using images with higher resolution.

**Loss function**

The **localization loss** is the mismatch between the ground truth box and the predicted boundary box. SSD only penalizes predictions from positive matches. We want the predictions from the positive matches to get closer to the ground truth. Negative matches can be ignored.

The **confidence loss** is the loss of making a class prediction. For every positive match prediction, we penalize the loss according to the confidence score of the corresponding class. For negative match predictions, we penalize the loss according to the confidence score of the class “0”: class “0” classifies no object is detected.

The final loss function is computed as:

where N is the number of positive matches and α is the weight for the localization loss.

**Hard negative mining**

However, we make far more predictions than the number of objects present. So there are many more negative matches than positive matches. This creates a class imbalance that hurts training. We are training the model to learn background space rather than detecting objects. However, SSD still requires negative sampling so it can learn what constitutes a bad prediction. So, instead of using all the negatives, we sort those negatives by their calculated confidence loss. SSD picks the negatives with the top loss and makes sure the ratio between the picked negatives and positives is at most 3:1. This leads to faster and more stable training.

**Data augmentation**

Data augmentation is **important** in improving accuracy. Augment data with flipping, cropping, and color distortion. To handle variants in various object sizes and shapes, each training image is randomly sampled by one of the following options:

- Use the original,
- Sample a patch with IoU of 0.1, 0.3, 0.5, 0.7 or 0.9,
- Randomly sample a patch.

The sampled patch will have an aspect ratio between 1/2 and 2. Then it is resized to a fixed size and we flip one-half of the training data. In addition, we can apply photo distortions.

Here is the performance improvement after data augmentation:

# Inference time

SSD makes many predictions (8732) for better coverage of location, scale, and aspect ratios, more than many other detection methods. However, many predictions contain no object. Therefore, any predictions with class confidence scores lower than 0.01 will be eliminated.

**non-maximum suppression (nms)**

SSD uses **non-maximum suppression **to remove duplicate predictions pointing to the same object. SSD sorts the predictions by the confidence scores. Start from the top confidence prediction, SSD evaluates whether any previously predicted boundary boxes have an IoU higher than 0.45 with the current prediction for the same class. If found, the current prediction will be ignored. At most, we keep the top 200 predictions per image.

# Result

The model is trained using SGD with an initial learning rate of 0.001, 0.9 momentum, 0.0005 weight decay, and batch size 32. Using an Nvidia Titan X on the VOC2007 test, SSD achieves 59 FPS with mAP 74.3% on the VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%.

Here is the accuracy comparison for different methods. For SSD, it uses an image size of 300 × 300 or 512 × 512.

This is the recap of the speed performance in frame per second.

# Findings

Here are some key observations:

- SSD performs worse than Faster R-CNN for small-scale objects. In SSD, small objects can only be detected in higher resolution layers (leftmost layers). But those layers contain low-level features, like edges or color patches, that are less informative for classification.
- Accuracy increases with the number of default boundary boxes at the cost of speed.
- Multi-scale feature maps improve the detection of objects at a different scale.
- Design better default boundary boxes will help accuracy.
- COCO dataset has smaller objects. To improve accuracy, use smaller default boxes (start with a smaller scale at 0.15).
- SSD has lower localization error comparing with R-CNN but more classification error dealing with similar categories. The higher classification errors are likely because we use the same boundary box to make multiple class predictions.
- SSD512 has better accuracy (2.5%) than SSD300 but run at 22 FPS instead of 59.

**Conclusion**

SSD is a single-shot detector. It has no delegated region proposal network and predicts the boundary boxes and the classes directly from feature maps in one single pass.

To improve accuracy, SSD introduces:

- small convolutional filters to predict object classes and offsets to default boundary boxes.
- separate filters for default boxes to handle the difference in aspect ratios.
- multi-scale feature maps for object detection.

SSD can be trained end-to-end for better accuracy. SSD makes more predictions and has better coverage on location, scale, and aspect ratios. With the improvements above, SSD can lower the input image resolution to 300 × 300 with a comparative accuracy performance. By removing the delegated region proposal and using lower resolution images, the model can run at real-time speed and still beats the accuracy of the state-of-the-art Faster R-CNN.