- That is changed in V2. Every guess can output a different class.
- The layers used for object detection has low spatial resolution and too hard to locate small objects. Then, you may ask why not using higher resolution layer. This will give you some answers:
medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c