Nice question. You can have an infinite number of models. But the key point is if you keep improving your policy, your policy should converge. For a discrete action space, there are finite combinations of actions. So if you say you are improving at each step, it should take a finite amount of steps to get there (or at least to a local optimum). If you have a continuous action space, we can argue that once you reach certain precision, you don’t care much. So it should also take a finite step in theory.