In the “vanilla” PG, it is mainly on-policy. i.e. when the policy is updated, we use the new policy to collect new samples and use them to compute the new policy gradient.

But PG can be off-policy too by using importance sampling. Say we have a policy p1. We compute the policy gradient and use it to create p2. Next, we can still use p1 samples (through importance sampling) to compute the next p3. As shown below, there are 2 policies in the equation:

Image for post
Image for post

The catch is you cannot use p1 samples for too long. You have to recollect samples from the current policy again say in every 4 iterations.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store