OK. I think I am still sleepy but let’s give it another try. Thanks for your patience.
According to Sutton,
On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.
Recall that the distinguishing feature of on-policy methods is that they estimate the value of a policy while using it for control. In off-policy methods, these two functions are separated. The policy used to generate behavior, called the behavior policy, may in fact be unrelated to the policy that is evaluated and improved, called the target policy. An advantage of this separation is that the target policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions.
According to OpenAI,
PPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy.
In this context, is the current policy used as the behavior policy? Is this policy used for generating data and exploration. I agree with you that the answer is yes.