There is only one prediction per default box. Each spatial location (say a total of kxk for that layer) has 6 default box, for example, there is at most kxkx6 predictions for that layer.

For localization loss, we count positive match only.

But for confidence loss (classification error), we count both positive and negative.

To maintain class balance, we maintain a certain ratio between the positive and negative examples. For predictions that should not contain an object, we sort them by the confidence score for class 0 and use those having the lowest score for training.

