For our discussion here, let’s say k=3.
“position-sensitive score maps” are not filters. We start with the feature map and use 9 different groups of convolution filters to create 9 position-sensitive score maps for each class. These are kind of feature maps but each one responsible for different part of the object. Say we are working on the top left corner, then a position-sensitive score maps will store the feature suppose to be the top left corner of the object that wants to be detected. The purpose here is to detect a face which the top left corner should be an eye.