It all depends on whether the feature extraction layer can extract enough information for the purpose of region proposal & classification. In many object detection algorithms, they believe they can. So they use one or very few layers after that only. Will mAp improve if they add more layers? That is a question of speed v.s. accuracy. Faster R-CNN still has very high accuracy. So it helps to add layers but performance will drop.