Conceptually, it is two networks with one estimating the Q value so we use say L2-loss to learn it and one estimating the policy which use policy gradient to learn it. This is a hybrid approach that wants to take advantage of both worlds.
I have an article written long term ago on that but I have not released it yet. It may take a while since after I finish one more article, I will take a couple of months break before coming back to RL again.
I am not sure what “code” you are referring to. But for this code base, https://github.com/openai/baselines/tree/master/baselines/ddpg, it is two different instances.