RL — The Math behind TRPO & PPO

8 min readSep 13, 2018

TRPO Trust Region Policy Optimization & Proximal Policy Optimization PPO are based on the Minorize-Maximization MM algorithm. In this article, we cover the basic MM algorithm and go through the steps on how the objective function for TRPO & PPO is derived. In our Reinforcement Learning series, we will cover each topic separately. But in this article, we will detail the math a little more for those curious readers that want to understand the reasoning behind those objective functions.

Surrogate function

RL is about maximizing the expected discounted rewards. The red curve below represents the expected discounted rewards η which is defined as:

MM is an iterative method. For each iteration, we find a surrogate function M (the blue curve) that is

a lower bound function for η,
approximate η at the current policy, and
easy to optimize (we will approximate the surrogate function as a quadratic equation).

In each iteration, we find the optimal point for M and use that as the current policy.

Then, we re-evaluate a lower bound at the new policy and repeat the iteration. As we continue the process, the policy keeps improving. Since there is only finite possible policy, our current policy will eventually converge to the local or the global optimal.

Objective function

As a preview, the natural policy gradient, TRPO, and PPO starts with this objective function. We will go through the proof in more details next.

In short, we want to maximize the advantage function: the expected rewards minus a baseline to reduce variance. But we apply a constraint that the new policy cannot be too different from the old policy. The remaining of this article will justify this constraint mathematically.

Notation for the value, action-value and advantage functions:

First, let’s define the Q-value function, the value function, and the advantage function first. That should be pretty straight forward.

Expected discounted reward

The expected discounted reward η is calculated as:

Alternatively, we can compute the reward of a policy using another policy. This lays down the foundation of comparing two policies.

Proof:

Function 𝓛

Next, to use the MM algorithm, we want to find the lower bound function approximating η locally at the current policy. Let’s define function 𝓛 as:

As a preview, 𝓛 is part of the lower bound function M (underlined as red below).

The second term in M is the KL-divergence

At the current policy, it equals to KL(θi, θi) which is 0. The term C times KL can be view as the upper bound error for L.

At the current policy θi, we can show that L matches η to the first order.

As KL(θi, θi)=0, M approximates the expected rewards locally: a requirement for the MM algorithm. Next we will show the details of the lower bound function M. A two-page proof in the appendix A of the TRPO paper establishes an lower bound for η.

D_TV is the total variation divergence. But it is not very importance, since we will replace it soon with the KL-divergence because

We rewrite the lower bound function as:

Note: we also simplify the notation with:

Monotonically improving guarantee

The key concept of the Natural Policy Gradient is the Monotonically improving guarantee. It is the “money guarantee” version in the Policy Gradient methods. In short, at least in theory, any policy updates will be guaranteed better than the old one. What we really prove here is the new policy generated from optimizing M will have a guarantee that it will perform better in η (the real expected rewards) than the old policy. Since there are only finite policies, the continuous improvement will only lead us to a local or a global optimal point. Here is the proof:

Policy iteration with Monotonically Improving Guarantee

Here is the iteration algorithm that guarantees that the new policy will always perform better than the current one.

However, finding the maximum of KL divergence (among all policies) is intractable. We will relax the requirement and use the mean of the KL divergence instead.

Recall 𝓛 as:

We can use importance sampling to estimate the L.H.S. by sampling from policy q:

Therefore, the objective can be rewritten as:

With Lagrangian duality, a constraint for an objective function can be integrated back to the objective function with a multiplier. Both are mathematically the same:

Intuition

In Gradient ascent, we determine the steepest direction and then step forward in that direction. However, if the learning rate is too high, the action may deviate from the real reward function too much that, it turns disaster.

In the trust region, we limit our search within a region controlled by δ. Mathematically, we prove in the previous section that such region guarantees that its optimal policy will outperform the current one until it reaches a local or global optimal.

As we continue the iteration, we can reach the optimal point.

Optimizing the objective function

As we mentioned before, the lower bound function M should be easy to optimize. In fact, we approximate L and the mean of the KL-divergence using Taylor Series up to the first order for L and second order for the KL-divergence:

where g is the policy gradient and H is called the Fisher Information matrix FIM in the form of a Hessian matrix.

The optimization problem becomes:

We can solve it by optimizing a quadratic equation and the solution becomes:

The step calculated above is called the natural policy gradient. Here is the algorithm in optimizing policy using MM algorithm and the Natural Policy Gradient.

TRPO

However, FIM (H) and its inverse are very expensive to compute. TRPO estimates the following term

by solving x for the following linear equation:

It turns out we can solve it with the conjugate gradient method which is computational less complex than finding the inverse of H. Conjugate gradient is similar to gradient descent but it can find the optimal point in at most N iterations where N is the number of parameters in the model.

The resulting algorithm is:

PPO with Clipped Objective

In PPO with clipped objective, we want to use a first-order optimizer like the gradient descent to optimize our problem with a soft constraint to penalize the objective if we change the policy too large.

In implementation, we maintain two policy networks. The first one is the current policy that we want to refine.

The second is the policy that we last used to collect samples.

With the idea of importance sampling, we can evaluate the new policy with samples collected from an older policy. This improves sample efficiency.

But as we refine the current policy, the difference between the current and the old policy is getting larger. The variance of the estimation will increase. So, say for every 4 iterations, we synchronize the second network with the current policy again.

In PPO, we compute a ratio between the new policy and the old policy:

And we construct a new objective function to clip the estimated advantage function if the new policy is far away from the old policy. The new objective function is:

If the probability ratio between the new policy and the old policy falls outside the range (1- ε) and (1 + ε), the advantage function will be clipped. ε is set to 0.2 for the experiments in the PPO paper.