Image for post
Image for post

Design choices, lessons learned and trends for object detections?

Detectors, like region-based detectors or single shot detectors, start from different paths but look much similar now as they fight for the title of the fastest and most accurate detector. In fact, some of the performance difference may be originated from the subtle design and implementation choices rather than on the merits of the model. In the Part 3 here, we will cover some of those design choices followed by some benchmarks done by Google Research. Then we will conclude our series by summarizing how we get here and what the lessons learn so far.

Part 1: What do we learn from region based object detectors (Faster R-CNN, R-FCN, FPN)?

Part 2: What do we learn from single shot object detectors (SSD, YOLO), FPN & Focal loss?

Part 3: Design choices, lessons learned and trends for object detections?

Box encoding and loss function

Detectors use different loss functions and box encoding methods. For example, SSD predicts the square root of width and height to normalize errors. So a 2-pixel difference for a small boundary box is treated more significant than a large boundary box. Here are the different loss functions and box encoding schemes used by different methods.


To train the model better, we apply different weights for different losses. For example, in YOLO, the weight for localization loss is higher than classification so we can locate objects better.

Feature extractors (VGG16, ResNet, Inception, MobileNet)

Feature extractors impact both accuracy and speed. ResNet and Inception are often used if accuracy is far more important than speed. MobileNet provides a lightweight detector that works well with SSD that targets mobile device for real-time processing. For Faster R-CNN and R-FCN, the choice of feature extractors has more impact on accuracy comparing with SSD.

Non-max suppression (nms)

nms only runs on CPU and often takes up the bulk of the running time for the single shot model.

Data augmentation

Augment data by cropping image helps the training in detecting objects in different scales. In inference time, we may use multi-cropping for the input image to improve accuracy. But usually, it is slow and not feasible for real-time processing.

Feature map strides

Single shot detectors often have options of which feature map layers to use for object detection. A feature map has a stride of 2 if the resolution decreases by 2 in each dimension. Lower resolution feature maps usually detect higher-level structures that are good for object detection. But the loss of spatial dimension makes it harder to detect small objects.

Speed v.s. accuracy

The most important question is not which detector is the best. The real question is which detector and what configurations give us the best balance of speed and accuracy each application needed. Below is the comparison of accuracy v.s. speed tradeoff (time measured in millisecond).

In general, Faster R-CNN is more accurate while R-FCN and SSD are faster. Faster R-CNN using Inception ResNet with 300 proposals gives the highest accuracy at 1 FPS. SSD on MobileNet has the highest mAP within the fastest models. This graph also helps us to locate some sweet spots with a good return in speed and cost tradeoff. R-FCN models using Residual Network strikes a good balance between accuracy and speed while Faster R-CNN with Resnet can attain similar performance if we restrict the number of proposals to 50.

Feature extractor accuracy

The paper studies how the accuracy of the feature extractor impacts (top 1% accuracy on classification) on the detector accuracy. Both Faster R-CNN and R-FCN can take advantage of a better feature extractor, but it is less significant with SSD.

Image for post
Image for post

Object size

For large objects, SSD performs pretty well even with a simpler extractor. SSD can even match other detector accuracies with better extractor. But SSD performs much worse on small objects comparing to other methods.

Image for post
Image for post

Input image resolution

Higher resolution improves object detection for small objects significantly while also helping large objects. When decreasing resolution by a factor of two in both dimensions, accuracy is lowered by 15.88% on average but the inference time is also reduced by a factor of 27.4% on average.

Image for post
Image for post

Number of proposals

The number of proposals generated can impact Faster R-CNN (FRCNN) significantly without a major decrease in accuracy. For example, for Inception Resnet, Faster R-CNN can improve the speed 3x when using 50 proposals instead of 300. The drop in accuracy is just 4% only. Because R-FCN has much less work per ROI, the speed improvement is far less significant.

Image for post
Image for post

The Journey & the trend

We start our discussion in object detection with sliding windows over an image.

# Sliding windows
for window in windows
patch = get_patch(image, window)
results = detector(patch)

To improve speed, we either reduce the amount of windows or reduce works needed for each ROI (i.e. move works outside of the for-loop). R-CNN uses a region proposal network to reduce the amount of windows (ROIs) to about 2000. Fast R-CNN reduces the amount of works for each ROI by using the feature maps instead of the image patches to detect objects. This saves us time from applying feature extractions 2000 times.

# Fast R-CNN
feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
patch = roi_pooling(feature_maps, ROI)
results = detector2(patch)

However, region proposal takes time. Faster R-CNN replaces the external region proposal method by a convolutional network and reduces the inference time from 2.3s to 0.3s. Faster R-CNN also introduces anchors so our predictions are more diverse and the model is much easier to train. The journey to cut work per ROI is not finished. R-FCN computes position-sensitive score maps independent of ROIs. This map scores the chance of finding a certain part of a class object. The probability of finding an object is simply average those score.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
score_maps = compute_score_map(feature_maps)
for ROI in ROIs
V = pool(score_maps, ROI)
class_scores = average(V)
class_probabilities = softmax(class_scores)

However, even R-FCN is faster, it can be less accurate than Faster R-CNN. But why we need 2-stage computation, one for ROIs and one for object detection. Single shot detector removes the need to have individual computations for each ROIs. Instead, it predicts both boundary boxes and classes in a single shot simultaneously.

feature_maps = process(image)
results = detector3(feature_maps) # No more separate step for ROIs

Both SSD and YOLO are single shot detectors. Both use convolutional layers to extract features followed by a convolution filter to make predictions. Both use relatively low-resolution feature maps for object detection. Therefore, their accuracy is usually lower than region based detectors because they perform much worse for small objects. To remedy the problem, single shoot detectors add higher resolution feature maps to detect objects. However, high-resolution feature maps contain fewer high-level structures and therefore object prediction is less accurate. FPN mitigates that by deriving the higher resolution feature map from the original feature map and the upsampled lower resolution maps. This adds high-level structure information while retains more accurate spatial location information. The overall accuracy is improved since it detects objects at different scale better.

During training, we are dealing with many predictions on the image background rather than real objects. We train the model well to detect background but not necessary on real objects. Focal loss reduces the importance of classes that are already trained well. By combining a more complex feature extractor, FPN, and the Focal loss, RetinaNet achieves some of the most accurate results for object detection.

The difference between detectors is narrowing. Single shot uses more complex designs to make it more accurate and region base detectors streamline the operation to make it faster. YOLO, for example, has incorporate features used in other types of detector. Eventually, the significant difference may not be in the basic concept of the models but on the implementation details.

Lesson learned

  • Feature Pyramid Networks produces semantic rich feature maps with high resolution object spatial information to improve accuracy.
  • Complex feature extractors like ResNet and Inception ResNet are key to high accuracy if speed is not a concern.
  • Single shot detectors with light but powerful extractor like MobileNet is good for real-time processing, in particular for less powerful mobile device.
  • Use batch normalization.
  • Experiment different feature extractors to find a good balance between speed and accuracy. Some light weight extractors make significant speed improvement with tolerable accuracy drop.
  • Use anchors to make boundary box predictions.
  • Select anchors carefully.
  • Crop images in training to learn features in different scales (data augmentation).
  • At the cost of speed, higher resolution input images improves accuracy, in particular for small objects.
  • Fewer proposals for Faster R-CNN can improve speed without too much accuracy drop.
  • End-to-end training with multi-task loss improves performance.
  • Experiment the number of proposals or predictions per grid cell.
  • Experiment different weights for different losses (localization, classification, etc).
  • Experiment atrous mode. It provides wider field of view at the same computational cost. It can help accuracy.

For single shot detectors:

  • They are faster but need further verifications on whether they can beat the accuracy in Faster R-CNN or R-FCN.
  • Use convolution filters to make boundary boxes and classification predictions simultaneously.
  • Use multiple feature map layers for object detection.
  • Tend to have problems for objects that are too close or too small.
  • Feature extraction becomes the speed bottleneck. Look for simpler networks that have no significant accuracy drop.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store