Image for post
Image for post
Photo by Daniel Cheung

RL — Reinforcement Learning Algorithms Quick Overview

This article overviews the major algorithms in reinforcement learning. Each algorithm will be explained briefly in a single context for an easy and quick overview.

Terms

Action-value functions Q, state-value function V and advantage function A are defined as:

Image for post
Image for post

Policy Evaluation

Policy evaluation computes the value functions for a policy π using the Bellman equations.

For example,

Image for post
Image for post
Image for post
Image for post

Policy evaluation using Monte Carlo Rollouts

Monte Carlo plays out the whole episode until the end to calculate the total rewards.

Image for post
Image for post
Source

Policy Iteration

Policy iteration evaluates (step 1)

Image for post
Image for post

and improves policy (step 2) continuously.

Image for post
Image for post
Image for post
Image for post
Source

Example: The following maze exits on the top-left or the bottom-right corner with a reward of -1 for every step taken. We evaluate and improve the policy continuously. Since there are only finite actions, the policy will eventually converge to the optimal policy π*.

Image for post
Image for post
Modified from source

Value Iteration

Compute the value functions iteratively under the Bellman equation using dynamic programming:

Image for post

Example: This maze exits on the top-left corner with a reward of -1 for every step taken.

Image for post
Image for post
Source

Algorithms:

Image for post
Image for post
Source

This method ignores the policy and focuses on the value function until the end. The final policy is derived from the value function as:

Image for post
Image for post

Fitted value iteration

Estimate the value function V with a model parameterized by Φ:

Image for post
Image for post
Source

Fitted Q iteration

The Bellman equation for the action-value function is:

Image for post
Image for post

Estimate the action-value function Q with a model parameterized by Φ.

Image for post
Image for post
Full fitted Q-iteration source
Image for post
Image for post
Online Q-iteration source

Q-learning (SARSA max)

Image for post
Image for post

Q-learning is an online action-value function learning with an exploration policy (like epsilon-greedy).

Image for post
Image for post
Image for post
Image for post
Modified from source

Pseudocode:

Image for post
Image for post
Source

Q-learning with replay buffer and the target network

Add a replay buffer and a target network for training stability.

Image for post
Image for post
Source

DQN (Deep Q-learning)

DQN is a Q-learning method with a deep network to estimate Q using a replay buffer and a target network.

Image for post
Image for post
Source

Policy Gradients

Maximize the rewards by taking actions with high rewards more likely.

Image for post
Image for post
Source

Example: Policy Gradient using Monte Carlo rollouts and a baseline:

Image for post
Image for post
Source

Actor-Critic Algorithm

Combining policy gradient and value-function learning.

Image for post
Image for post
Batch Actor-Critic source
Image for post
Image for post
Online Actor-Critic source

A3C algorithm: asynchronous actor-critic using advantage function.

Image for post
Image for post
Source

Deep Deterministic Policy Gradient DDPG (Continuous actions)

An actor-critic approach for continuous actions.

Image for post
Image for post
Source

Adding parameter noise for exploration:

Image for post
Image for post
Image for post
Image for post
Source

Natural Policy Gradient

Natural Policy Gradient is a better policy gradient method which guarantees policy improvement. It is invariant of how models are created and parameterized.

Image for post
Image for post
Source

TRPO

Optimize Natural Policy Gradient with Conjugate Gradient.

Image for post
Image for post
Source

Backtracking line search:

Image for post
Image for post

PPO

Optimize Natural Policy Gradient by clipping the advantage function.

Image for post
Image for post
Source

DAgger: Dataset Aggregation

Use human labels to improve imitation learning.

Image for post
Image for post
Source

Monte Carlo Tree Search (MCTS)

Search discrete action spaces using a search tree with an exploration tree policy.

Image for post
Image for post
Modified from source

Model-based Reinforcement Learning

MPC (Model Predictive Control)

Replan actions in every action taken.

Image for post
Image for post

Policy search with backpropagation

Use backpropagation to build a policy under a model-based method.

Image for post
Image for post

PILCO

Policy search with backpropagation:

Image for post
Image for post
Simplified from source

Policy search with Gaussian Process (GP)

Using a Gaussian Process for the dynamic model.

Image for post
Image for post

Guided Policy Search (GPS)

General scheme

Use locally optimized trajectory to guide the training of a policy.

Image for post
Image for post

Deterministic GPS

Guided Policy Search on deterministic policy.

Image for post
Image for post
Image for post
Image for post

Stochastic GPS

Guided Policy Search on stochastic policy.

Image for post
Image for post
Image for post
Image for post

Imitating optimal control

Use computer generated samples (s, a) from locally optimized trajectories to train a policy.

PLATO algorithm

With a Linear Gaussian controller:

Image for post
Image for post

We replan with:

Image for post
Image for post
Image for post
Image for post

Dyna-Q

Dyna-Q is a constant loop of learning the value function or policy from real and simulated experience. We use the real experience to build and refine the dynamic model, and use this model to produce simulated experience to complement the training of the value function and/or the policy.

Image for post
Image for post
Source
Image for post
Image for post
Source

Learned from sampled experience:

Image for post
Image for post
Source

Cross-entropy method (CEM)

Take “educated” random guesses on actions. Select top performers of guesses and use them as seeds for next round of guessing.

Image for post
Image for post
Source

Double Gradient Descent (DGD)

Optimize an objective under constraints.

Image for post
Image for post

ε-greedy policy

To sample the kth episode, we use a ε-greedy algorithm below to pick which action to sample next. Here is the data distribution used in selecting the action:

Image for post
Image for post

Generalized advantage estimation (GAE)

An n-step look ahead advantage function is defined as:

Image for post
Image for post

Advantage function with 1 to k-step lookahead.

Image for post
Image for post
Source

The final advantage function for GAE is

Image for post
Image for post
Source

where λ is a hyperparameter from 0 to 1. When λ is 1, it is Monte Carlo. When λ is 0, it is TD with one step look ahead.

Credits and references

UC Berkeley Reinforcement Learning

UC Berkeley Reinforcement Learning Bootcamp

UCL Reinforcement Learning

Reinforcement Learning: An Introduction

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store