In the “vanilla” PG, it is mainly on-policy. i.e. when the policy is updated, we use the new policy to collect new samples and use them to compute the new policy gradient.
But PG can be off-policy too by using importance sampling. Say we have a policy p1. We compute the policy gradient and use it to create p2. Next, we can still use p1 samples (through importance sampling) to compute the next p3. As shown below, there are 2 policies in the equation:
The catch is you cannot use p1 samples for too long. You have to recollect samples from the current policy again say in every 4 iterations.