Conceptually, it is two networks with one estimating the Q value so we use say L2-loss to learn it and one estimating the policy which use policy gradient to learn it. This is a hybrid approach that wants to take advantage of both worlds.

I have an article written long term ago on that but I have not released it yet. It may take a while since after I finish one more article, I will take a couple of months break before coming back to RL again.

I am not sure what “code” you are referring to. But for this code base,, it is two different instances.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store