Nice question. You can have an infinite number of models. But the key point is if you keep improving your policy, your policy should converge. For a discrete action space, there are finite combinations of actions. So if you say you are improving at each step, it should take a finite amount of steps to get there (or at least to a local optimum). If you have a continuous action space, we can argue that once you reach certain precision, you don’t care much. So it should also take a finite step in theory.

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store