That is about FPN. FPN have 2 path, The bottom-up path is just like the regular CNN which reducing the spatial dimension in extracting features. The top-down is the reverse direction (similar to deconvolution). So in YOLO3, in the reverse direction, it goes back 2 layers (instead of 1) to generated the feature maps needed for object detection. If you are very interested in why single shot detector has problems dealing with small objects, the FPN article should explain the issue and propose the solution which YOLOv3 based on.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store