First, this is just mathematical the same with unlimited computation resources. In reality, the policy needs to be refreshed pretty frequently.

Second, we just use the old policy information to do importance sampling to estimate the rewards if the current policy is taken. The current policy keeps changing. Problems happen when such an estimation is not good enough. So after refresh, the current and old policy is still close and can be reasonably estimated.

