First, this is just mathematical the same with unlimited computation resources. In reality, the policy needs to be refreshed pretty frequently.

Second, we just use the old policy information to do importance sampling to estimate the rewards if the current policy is taken. The current policy keeps changing. Problems happen when such an estimation is not good enough. So after refresh, the current and old policy is still close and can be reasonably estimated.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store