RL — Value Fitting & Q-Learning

Value Fitting

Modified from source


  1. We sample action from a state using epsilon greedy (SA).
  2. We observed the reward and the next state (SARS’).
  3. We use the Q-value function to find the action a’ with the maximum Q-value (SARSA Max).
Modified from source
  • It is an off-policy method that utilizes off-policy samples. This improves sample efficiency (the number of samples needed to optimize a policy).
  • It has a lower variance comparing with policy gradient methods which usually have a high variance policy gradient.

Deep Q-learning DQN with the replay buffer and the target network

  • Input samples within a trajectory are highly correlated. This is bad for supervised learning. It amplifies changes and noises.
  • Our target value keeps changing.

Partially Observable MDP (POMDP)

Limitation and Tradeoff

  • Value iteration, using value function V, requires the model P.
  • This can be addressed by using the Q-value function instead.
  • Value iteration and Q-Learning require a function estimator to scale the solution to continuous or large state space.
  • They reuse old sample data or results. This improves data efficiency.
  • Compared with PG or the Monte-Carlo methods, these methods have low variance and less volatile gradient change. The training can be more stable.
  • A non-linear function estimator does not have any convergence guarantee.
  • Value Learning does not optimize the policy directly. Optimizing a policy is not necessarily the same as value learning. In-accuracy may introduce.
  • Finding the optimal action using Q-value requires finding the maximum Q effectively. This may not be easy for continuous or large action space. Complex optimization techniques may be required to optimize:

Value learning v.s. Policy Gradient



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store