RL — Reinforcement Learning Algorithms Overview

Jonathan Hui
8 min readDec 12, 2022

We have examined many Reinforcement Learning (RL) algorithms in this series, for instance, Policy Gradient methods for the MoJoCo tasks, DQN for Atari games, and Model-based RL for the robotic controls. While many algorithms are introduced with specific domains, such ties can simply be legacy only. In this article, we will take an overview of these algorithms and discuss their general tradeoffs in choosing what methods to use.

Model-free algorithms

RL algorithms can be divided into Model-based and Model-free. In Model-free RL, we don’t know and we don’t want to learn the system dynamics

or, we just don’t care since the method does not need the knowledge of state transition. We sample actions and observe the corresponding rewards to optimize a policy or to fit a value function.

Model-free RL is divided into policy optimization or value learning:

Here are some high-level descriptions in each category:

  • Value iteration — Value iteration uses dynamic programming to compute the value function iteratively.
Modified from source
  • Policy iteration — We compute the value function and optimize the policy in alternative steps.
Source
  • Value-Learning/Q-Learning (e.g. DQN, Q-learning, TD, Fitted value iteration)— Without an explicit policy, we fit the value-function or Q-value function iteratively with observed rewards under actions taken by an off-policy, like an ε-greedy policy which selects action based on the Q-value function and sometimes random actions for exploration.
  • Policy gradients (e.g. REINFORCE, TRPO, PPO) — We optimize the policy using the gradient ascent on the reward w.r.t. the policy parameter θ. Intuitively, it improves policy gradually to make high-reward actions under a specific state to be more likely.
  • Actor-Critic (e.g. A3C, SAC) —The gradient in the Policy Gradient method can be formulated with an actor in deciding which action to take and a critic in estimating the value or Q-function. The Actor-Critic method is mainly a Policy Gradient method with the advantage function computed by the observed reward and a critic network. In this method, we fit the critic (value-function or Q-function) and the actor iteratively.
  • Evolution — We start with a random policy. We take action and observe the rewards. We make many guesses (actions) and keep those with higher rewards. By repeating this process, we allow the more promising solutions to evolve and mature.
Source

Model-based RL

For the Model-based RL, we start with the system dynamics (model) and the cost function to make optimal decisions.

  • Trajectory optimization (e.g. iLQR)— Plan an optimal controller for optimal actions (optimal policy) using the model and the cost function.
  • Monte Carlo Tree Search (MCTS) — MCTS is the trajectory optimization in the discrete control space. MCTS searches for possible moves and records the results in a search tree. It uses Q-value function to exploit promising actions and adds exploration bonuses for better exploration. Eventually, it builds a local policy in determining the next move.
  • Backpropagation (e.g. PILCO) — If the cost function and the model are known, we can use backpropagation through the cost function and the model to improve a policy.
  • Imitating optimal control (e.g. Guided Policy Search) —We train an optimal controller with concepts similar to trajectory optimization. But in parallel, we use the sampled trajectories produced by the optimal controller to create a policy 𝜋 (the red box below).
Modified from source
  • Planning (e.g. Dyna) — Dyna uses real experience to fit a value function (step d) and record the dynamic transition and reward in “Model” in step e. Now, we can replay the experience from “Model” randomly (step f) to train the value function further.
Source
Source

Objective

Both model-free and model-based RL optimize actions to maximize the expected rewards.

Value-learning

Value-learning use supervised learning to fit the Q-value or V-value from sampling rewards.

Limitations:

  • Minimize the error fit: not optimizing the ultimate policy directly.
  • No guarantee of convergence when a non-linear approximator for the value function is used.
  • Have stability problems if the value function is approximated, instead of computing it exactly. It can be mitigated with schemes like those in DQN.
  • Lots of hyperparameters: target network delay, replay buffer size, and it is sensitivity to learning rates.

Policy Gradient

Policy gradient maximizes the likelihood of actions that receive high rewards.

To stabilize the training, we subtract the expected rewards with a baseline to reduce variance.

One common choice for the baseline is V(S). Here are the definitions:

Strength:

  • Optimize the policy directly.
  • Gradient descent solution that works well with deep learning models (including RNN).
  • Can add constraints and incentives to the objective functions easily.

Limitations

  • High variance. Policy gradient is noisy — to mitigate the problem, use a larger batch size or trust region.
  • Hard to tune learning rate — use Adam or Natural Policy Gradient methods like PPO.
  • Bad sample efficiency — needs a lot of samples.
  • The baseline design for the expected rewards can be complex.
  • Not defining the problem exactly as an optimization problem.

Monte-Carlo Policy Gradient

Here is the algorithm for Policy Gradient which uses a full episode rollout (Monte Carlo method) to evaluate the total rewards.

TRPO

Policy Gradient may make aggressive policy changes that destroy the training progress. TRPO is a Policy Gradient method using the natural policy gradient. It uses trust region to limit policy changes in each iteration for safer and better decisions.

Source

Limitation:

  • The conjugate gradient in TRPO is complex and computationally expensive.
  • The computation of H, which measures the curvature of the policy w.r.t. policy model parameters θ, does not scale well with an expressive policy model, like CNN and RNN. It performs badly in the Atari benchmark.

PPO

PPO is similar to TRPO. It slightly relaxes the policy change limitation so it can use a faster first-order optimizer, instead of computing the second-order derivative.

Source

Actor-critic

The Actor-critic method is a Policy Gradient method with value learning to minimize the variance of the expected rewards. This change stabilizes training.

Model-based RL

Model-based RL starts with the system dynamics (the model) and the cost function to find optimal controls. Control has the same meaning as the action here. Once an optimal controller is trained, we use it solely to take action. But sometimes, the policy for a task can be more simple than a model. If that is true, we can use the model and further planning to create samples to train a policy (Guided Policy Search). We can also use the trained model to generate samples to learn a value function (Dyna).

Model-based RL (No explicit policy)

We fit the model with sampled trajectories and use trajectory optimization to plan for a controller that makes optimal controls.

In planning:

  • For continuous spaces, use trajectory optimization.
  • For discrete spaces, use methods like the Monte Carlo tree search.

Limitations

  • Focus on building a better model. But a better model is not the same as a better policy.

Model-based RL (Policy learning through backpropagation)

Weakness

  • States and actions inside a trajectory are highly correlated. Therefore, the computed gradients tend to amplify themselves. It can suffer vanishing and exploding problems badly.

Guided Policy Search GPS

GPS uses trajectory optimization to optimize a controller. Then, we run the controller to generate trajectories so we can learn a policy using supervised learning.

Modified from source

Strength

  • In some problem domains, the policy is simpler to learn than the model.

Dyna

We train a model using real experience. Later, we use the trained model to generate samples to learn the Q-value function better.

Model-based RL using sample-based planning (source)
Source

Strength

  • Improve sampling efficiency. Sampled trajectories are used directly or indirectly multiple times to improve value learning.

--

--