Object detection: speed and accuracy comparison (Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3)

  • Feature extractors (VGG16, ResNet, Inception, MobileNet).
  • Output strides for the extractor.
  • Input image resolutions.
  • Matching strategy and IoU threshold (how predictions are excluded in calculating loss).
  • Non-max suppression IoU threshold.
  • Hard example mining ratio (positive v.s. negative anchor ratio).
  • The number of proposals or predictions.
  • Boundary box encoding.
  • Data augmentation.
  • Training dataset.
  • Use of multi-scale images in training or testing (with cropping).
  • Which feature map layer(s) for object detection.
  • Localization loss function.
  • Deep learning software platform used.
  • Training configurations including batch size, input image resize, learning rate, and learning rate decay.

Performance results

VOC 2012 for Faster R-CNN.
COCO for Faster R-CNN
VOC 2012 for R-FCN
COCO for R-FCN
SSD
Speed is measure with a batch size of 1 or 8 during inference.
COCO for SSD
VOC 2007 for YOLOv2
VOC 2012 for YOLOv2
COCO for YOLOv2
COCO for YOLOv3
Performance for YOLO2 with COCO
COCO for FPN
COCO for RetinaNet
COCO for RetinaNet

Comparing paper results

Result on COCO

Takeaway so far

  • Region based detectors like Faster R-CNN demonstrate a small accuracy advantage if real-time speed is not needed.
  • Single shot detectors are here for real-time processing. But applications need to verify whether it meets their accuracy requirement.

Comparison SSD MobileNet, YOLOv2, YOLO9000 and Faster R-CNN

Report by Google Research (Source)

Source
  • Faster R-CNN using Inception Resnet with 300 proposals gives the highest accuracy at 1 FPS for all the tested cases.
  • SSD on MobileNet has the highest mAP among the models targeted for real-time processing.
  • R-FCN models using Residual Network strikes a good balance between accuracy and speed,
  • Faster R-CNN with Resnet can attain similar performance if we restrict the number of proposals to 50.
Source
Source
Source
Source
Source
Source
Source
Source
Source
Source

Lessons learned

  • R-FCN and SSD models are faster on average but cannot beat the Faster R-CNN in accuracy if speed is not a concern.
  • Faster R-CNN requires at least 100 ms per image.
  • Use only low-resolution feature maps for detections hurts accuracy badly.
  • Input image resolution impacts accuracy significantly. Reduce image size by half in width and height lowers accuracy by 15.88% on average but also reduces inference time by 27.4% on average.
  • Choice of feature extractors impacts detection accuracy for Faster R-CNN and R-FCN but less reliant for SSD.
  • Post processing includes non-max suppression (which only run on CPU) takes up the bulk of the running time for the fastest models at about 40 ms which caps speed to 25 FPS.
  • If mAP is calculated with one single IoU only, use mAP@IoU=0.75.
  • With an Inception ResNet network as a feature extractor, the use of stride 8 instead of 16 improves the mAP by a factor of 5%, but increased running time by a factor of 63%.
  • The most accurate single model use Faster R-CNN using Inception ResNet with 300 proposals. It runs at 1 second per image.
  • The most accurate model is an ensemble model with multi-crop inference. It achieves state-of-the-art detection on 2016 COCO challenge in accuracy. It uses the vector of average precision to select five most different models.
  • SSD with MobileNet provides the best accuracy tradeoff within the fastest detectors.
  • SSD is fast but performs worse for small objects comparing with others.
  • For large objects, SSD can outperform Faster R-CNN and R-FCN in accuracy with lighter and faster extractors.
  • Faster R-CNN can match the speed of R-FCN and SSD at 32mAP if we reduce the number of proposal to 50.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store