The default box is even determined before training. During training, the network extracts features and train to learn the boundary box relative to the default box. After the training, the network should be able to create this mapping based on the features. So in inference, the network extracts features and predict the boundary box. About the default box, it can be determined by you to have a different aspect ratio for objects that you expected. There are other methods but that is the essence of it.