Image for post
Image for post

Understanding Feature Pyramid Networks for object detection (FPN)

Detecting objects in different scales is challenging in particular for small objects. We can use a pyramid of the same image at different scale to detect objects (the left diagram below). However, processing multiple scale images is time consuming and the memory demand is too high to be trained end-to-end simultaneously. Hence, we may only use it in inference to push accuracy as high as possible, in particular for competitions, when speed is not a concern. Alternatively, we create a pyramid of feature and use them for object detection (the right diagram). However, feature maps closer to the image layer composed of low-level structures that are not effective for accurate object detection.

Image for post
Image for post
Source

Data Flow

Image for post
Image for post
FPN (Source)
Image for post
Image for post
Feature extraction in FPN (Modified from source)
Image for post
Image for post
Modified from source
Image for post
Image for post
Reconstruct spatial resolution in the top-down pathway. (Modified from source)
Image for post
Image for post
Add skip connections (Source)

Bottom-up pathway

The bottom-up pathway uses ResNet to construct the bottom-up pathway. It composes of many convolution modules (convi for i equals 1 to 5) each has many convolution layers. As we move up, the spatial dimension is reduced by 1/2 (i.e. double the stride). The output of each convolution module is labeled as Ci and later used in the top-down pathway.

Image for post
Image for post

Top-down pathway

We apply a 1 × 1 convolution filter to reduce C5 channel depth to 256-d to create M5. This becomes the first feature map layer used for object prediction.

Image for post
Image for post

FPN with RPN (Region Proposal Network)

Image for post
Image for post
Source
Image for post
Image for post

FPN with Fast R-CNN or Faster R-CNN

Let’s take a quick look at the Fast R-CNN and Faster R-CNN data flow below. It works with one feature map layer to create ROIs. We use the ROIs and the feature map layer to create feature patches to be fed into the ROI pooling.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Segmentation

Just like Mask R-CNN, FPN is also good at extracting masks for image segmentation. Using MLP, a 5 × 5 window is slide over the feature maps to generate an object segment of dimension 14 × 14 segments. Later, we merge masks at a different scale to form our final mask predictions.

Image for post
Image for post
Source

Results

Placing FPN in RPN improves AR (average recall: the ability to capture objects) to 56.3, an 8.0 points improvement over the RPN baseline. The performance on small objects is increased by a large margin of 12.9 points.

Lessons learned

Here are some lessons learned from the experimental data.

  • Top-down pathway restores resolution with rich semantic information.
  • But we need lateral connections to add more precise object spatial information back.
  • Top-down pathway plus lateral connections improve accuracy by 8 points on COCO dataset. For small objects, it improves 12.9 points.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store