OK. I think I am still sleepy but let’s give it another try. Thanks for your patience.

According to Sutton,

On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

Recall that the distinguishing feature of on-policy methods is that they estimate the value of a policy while using it for control. In off-policy methods, these two functions are separated. The policy used to generate behavior, called the behavior policy, may in fact be unrelated to the policy that is evaluated and improved, called the target policy. An advantage of this separation is that the target policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions.

According to OpenAI,

PPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy.

My thoughts

In this context, is the current policy used as the behavior policy? Is this policy used for generating data and exploration. I agree with you that the answer is yes.

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store