Object detection: speed and accuracy comparison (Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3)

  • Feature extractors (VGG16, ResNet, Inception, MobileNet).
  • Output strides for the extractor.
  • Input image resolutions.
  • Matching strategy and IoU threshold (how predictions are excluded in calculating loss).
  • Non-max suppression IoU threshold.
  • Hard example mining ratio (positive v.s. negative anchor ratio).
  • The number of proposals or predictions.
  • Boundary box encoding.
  • Data augmentation.
  • Training dataset.
  • Use of multi-scale images in training or testing (with cropping).
  • Which feature map layer(s) for object detection.
  • Localization loss function.
  • Deep learning software platform used.
  • Training configurations including batch size, input image resize, learning rate, and learning rate decay.

Performance results

VOC 2012 for Faster R-CNN.
COCO for Faster R-CNN
VOC 2012 for R-FCN
Speed is measure with a batch size of 1 or 8 during inference.
VOC 2007 for YOLOv2
VOC 2012 for YOLOv2
Performance for YOLO2 with COCO
COCO for RetinaNet
COCO for RetinaNet

Comparing paper results

Result on COCO

Takeaway so far

  • Region based detectors like Faster R-CNN demonstrate a small accuracy advantage if real-time speed is not needed.
  • Single shot detectors are here for real-time processing. But applications need to verify whether it meets their accuracy requirement.

Comparison SSD MobileNet, YOLOv2, YOLO9000 and Faster R-CNN

Report by Google Research (Source)

  • Faster R-CNN using Inception Resnet with 300 proposals gives the highest accuracy at 1 FPS for all the tested cases.
  • SSD on MobileNet has the highest mAP among the models targeted for real-time processing.
  • R-FCN models using Residual Network strikes a good balance between accuracy and speed,
  • Faster R-CNN with Resnet can attain similar performance if we restrict the number of proposals to 50.

Lessons learned

  • R-FCN and SSD models are faster on average but cannot beat the Faster R-CNN in accuracy if speed is not a concern.
  • Faster R-CNN requires at least 100 ms per image.
  • Use only low-resolution feature maps for detections hurts accuracy badly.
  • Input image resolution impacts accuracy significantly. Reduce image size by half in width and height lowers accuracy by 15.88% on average but also reduces inference time by 27.4% on average.
  • Choice of feature extractors impacts detection accuracy for Faster R-CNN and R-FCN but less reliant for SSD.
  • Post processing includes non-max suppression (which only run on CPU) takes up the bulk of the running time for the fastest models at about 40 ms which caps speed to 25 FPS.
  • If mAP is calculated with one single IoU only, use mAP@IoU=0.75.
  • With an Inception ResNet network as a feature extractor, the use of stride 8 instead of 16 improves the mAP by a factor of 5%, but increased running time by a factor of 63%.
  • The most accurate single model use Faster R-CNN using Inception ResNet with 300 proposals. It runs at 1 second per image.
  • The most accurate model is an ensemble model with multi-crop inference. It achieves state-of-the-art detection on 2016 COCO challenge in accuracy. It uses the vector of average precision to select five most different models.
  • SSD with MobileNet provides the best accuracy tradeoff within the fastest detectors.
  • SSD is fast but performs worse for small objects comparing with others.
  • For large objects, SSD can outperform Faster R-CNN and R-FCN in accuracy with lighter and faster extractors.
  • Faster R-CNN can match the speed of R-FCN and SSD at 32mAP if we reduce the number of proposal to 50.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store