When Redmon details the implementation later, he wrote:
As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 10−3 . At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.
So this is slightly different from his claim early in the paper. But this is the implementation detail which may not be very significant in understand the idea. But Redmon is very good at detailing his improvement which I really appreciate. Many other paper is sometimes very hard to know their implementation details to replicate the result.