That is about FPN. FPN have 2 path, The bottom-up path is just like the regular CNN which reducing the spatial dimension in extracting features. The top-down is the reverse direction (similar to deconvolution). So in YOLO3, in the reverse direction, it goes back 2 layers (instead of 1) to generated the feature maps needed for object detection. If you are very interested in why single shot detector has problems dealing with small objects, the FPN article should explain the issue and propose the solution which YOLOv3 based on.