The section on Regional Proposal Network should give you a high-level description.
This is a CNN deep neural network that produces the 256 component vector. The concept is not much different than a CNN classifier with the exception of 256 output values. If this still sounds strange to you, you may want to take a look into CNN first. Not sure, that is the problems you have or not since I don’t know what is your background. But hope this may give you some pointers.