Not exactly. I will avoid the words “sliding window” because it means something else for the older technology. At each cell, there are different default boxes, say 6 with different aspect ratios and scales. We are making one prediction relative to each default box. You may ask why not directly relative to the center of the cell instead of the default box. The short answer is it makes the training more stable, at least empirically.