We train the model with the assumption that it has one object per cell. For example, we can label the training data that way and computing the loss function the same. But in testing, is it legitimate to have one boundary box predicting cat and another for dog in the same grid cell (assuming YOLOv2 — not in v1) and the ground truth has dog and cat in the same grid also? The NMS described will work if this is more desirable behavior. Otherwise, you can do what you have in mind also. If you look at YOLOv3, they are actually not penalizing some predictions that have high confidence but not the best. In some perspective, from YOLOv1 to v3, the one object per grid is more for the training mechanism.
BTW, thanks for pointing that out.