RL — Trust Region Policy Optimization (TRPO) Part 2

Image for post
Image for post
Photo by Bernard Hermant

After understanding the three basic concepts behind TRPO in Part 1 (MM algorithm, trust region, and importance sampling), we will explain it in details in this last part of TRPO.

Optimization problem

In this section, we will define the objective we want to optimize. We will do it mathematically first. The math is harder to understand but we will explain the intuition later which is pretty simple.

What we optimize? Optimize the policy π’ can be reformulated as:

Image for post
Image for post

We change the notation to match the TRPO & PPO papers. The reward function notation has changed from η to J. π’ is the variable we want to optimize and π is the old policy. J(π), the expected reward for the old policy π, is a constant and therefore the change above does not alter the solution. In short, maximizing the reward for a policy is the same as maximizing the reward relative to a specific policy.

Function 𝓛

Now, we need to find the lower bound function needed for the MM algorithm. Its first term is the 𝓛 defined as:

Image for post
Image for post

where

Image for post
Image for post

Don’t get scared by 𝓛 now. d is the discounted future state distribution. If γ=1, d is just the state visit frequency under the policy π. A is the advantage function (a.k.a. calibrated expected rewards). But we can simply view 𝓛 as using importance sampling to estimate the advantage function.

The appendix A of the TRPO paper provides a 2-page proof that establishes the following boundary:

Image for post
Image for post

where D_KL is the KL-divergence that measures the difference between two data distribution p and q. KL-divergence is defined as:

Image for post
Image for post

With some twitting, this is our final lower bound M.

Image for post
Image for post

for our objective function

Image for post
Image for post

To summarize, we will use MM algorithms to maximize:

Image for post
Image for post
Image for post
Image for post

The inequality equation below is important because we can establish an upper bound error for the objective calculation. This establishes a trust region on whether we can trust the result.

Image for post
Image for post

In fact, with the Lagrangian Duality, our objective is mathematically the same as the following using a trust region constraint.

Image for post
Image for post

Recap: Objective function

Now, we establish the objective function that we optimize for each MM iteration. The first objective is referred to as KL penalized and the second one is referred to as KL constrained.

Image for post
Image for post

Guaranteed monotonic improvement

The power of TRPO, PPO, and natural policy gradient builds on the concept of guaranteed monotonic improvement. Theoretically, the policy update in each TRPO iteration creates a better policy. Let’s have a simple proof in the guaranteed monotonic improvement next. From here, we can prove

Image for post

The R.H.S. term below equals to zero when π’ = π. Therefore, the L.H.S. is always greater or equal to 0.

Image for post
Image for post

i.e. the new policy is always better than the old one. In fact, the new policy will have greater improvement in the real objective function than the lower bound approximation.

Image for post
Image for post
Image for post
Image for post

Intuition

Can we interpret the equation intuitively? We can approximate the expected advantage function locally around the current policy. But the accuracy decreases when the new policy and the current policy diverge from each other. But we can establish an upper bound for the error. Therefore, we can guarantee a policy improvement as long as we optimize the local approximation within a trusted region. Outside this region, the bet is off. Even it may have a better-calculated value, its range of error fails the improvement guarantee. With such a guarantee inside the trust region, we can locate the optimal policy iteratively. So even it takes a while to prove it mathematically, the reasoning is pretty simple.

KL penalized v.s. KL constrained

Mathematically, both KL penalized objective and the KL constrained objective are the same if we have unlimited computational resources. However, in practice, they are not.

Image for post
Image for post

C gets very high when γ (discount factor) is close to one and the corresponding gradient step size becomes too small. One solution is to turn both C and δ to tunable hyperparameters.

Image for post
Image for post

In practice, δ is much easier to tune than C. δ imposes a hard constraint to control the bad case scenarios in the policy space. It restraints policy changes that can turn destructive. Tuning C is much harder. Empirical results show that it cannot be a fixed value and need to be more adaptive. Therefore, trust region constraint is more popular. As a footnote, this can be fixed if certain enhancements are made to the KL penalty methods (just like the PPO method).

We have defined the optimization problems we want to solve. However, solving them are not easy. Next, we look at how the natural policy gradient method solves it analytically and how TRPO addresses its weakness.

Natural Policy Gradient

Natural Policy Gradient solves the following objective function analytically.

Image for post
Image for post

To solve it, we can use Taylor’s series to expand both terms above up to the second-order. But the second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.

Image for post
Image for post

Taking out all the zero terms, it becomes:

Image for post
Image for post

where g is the policy gradient (the same we learn in Policy Gradient) and H measure the sensitivity (curvature) of the policy relative to the model parameter θ. Our objective turns to:

Image for post
Image for post

This is a quadratic equation and can be solved analytically:

Image for post
Image for post

The term below maps the changes we want in the policy space into the corresponding parameter space.

Image for post
Image for post

It provides a solution that works regardless of how we parameterize the policy model (proof). H has different values in different parametrization. But it generates the parameter changes (Δθ) that can map to the same policy change. Therefore, our solution is model invariant. Gradient descent approximates the space to be locally flat. This corresponding to the Euclidean distance (squared root distance) that we are familiar with. If we replace H above with the Euclidean distance, we will realize that the parameter updates are not model invariant. The calculated parameter changes for different models will refer to different policy changes. This provides the theoretical reason why the natural policy gradient is better than the gradient ascent.

Natural policy gradient is a second order optimization that is more accurate and works regardless of how the model is parameterized (model invariant).

H is a Hessian matrix in the generic form of:

Image for post
Image for post

This particular matrix measures the curvature (second-order derivative) of the log probability of the policy. There is a special term for it: Fisher Information Matrix (FIM). Many literatures use F instead of H to represent the FIM. In this article, both H and F refer to FIM.

F can also be computed with the expression below:

Image for post
Image for post

The following is the pseudo code. We can samples trajectories from the current policy, and use them to compute the advantage functions. Then we compute the policy gradient and we use the equation above to compute FIM.

Image for post
Image for post
Source

The caveat of natural policy gradient

Finding the inverse of H is expensive if the policy is parametrized by many parameters, in particular for the deep network. Also, the inverse is often numerical unstable (errors are easily magnified by data imprecision).

Image for post
Image for post

Truncated Natural Policy Gradient

Instead of finding the inverse of FIM, we want to compute the following combined term directly

Image for post
Image for post

where x can be solved as:

Image for post
Image for post

We convert this equation into an optimization problem for a quadratic equation:

Image for post
Image for post

In short, we can transform our problem as optimizing the quadratic equation below:

Image for post
Image for post

To optimize this, we apply the conjugate gradient method. The concept is very similar to the gradient ascent but can be done in fewer iterations.

In gradient ascent, we always follow the steepest gradient. In our example, we want to ascend from the yellow spot to the red. Let’s say our first move will go to the right according to the gradient contour. To follow the steepest gradient again, the second move may go up and slightly to the left. We may ask one question here. Why does the second move go left? It sounds like undoing some progress made by the first move even every move get closer to the red spot.

Image for post
Image for post

If the objective function is quadratic, we can avoid this inefficiency using the conjugate gradient method. If the model has N parameters, we can find the optimal point in at most N ascent. In the first move, we follow the deepest gradient direction d and settles in the optimal point in that direction (the peak of the objective function in the direction of d). Then for the next search direction dj, it must be A-orthogonal (conjugate) to all previous direction di. Mathematically, it means

Image for post
Image for post

We can imagine those vectors as perpendicular to each other after some transformation with A.

Image for post
The arrows above are conjugate with each other. Source

Conjugate gradient is about finding a search direction every time that is A-orthogonal to all previous directions. So we make sure we are not undoing part of the progress previously make. Then we move in the new direction to the optimal point along this search line.

Image for post
Modified from source

These vectors (directions) are independent of each other and span an N-dimensional space. So we can solve the problem at most n steps. The explanation sounds complicated but actually, the algorithm has much lower complexity than computing the inverse of F. Now, we can compute

Image for post
Image for post

in N iterations. Truncated Natural Policy Gradient (TNPG) uses the conjugate method below (the underlined red line) to replace the requirement to compute the inverse of FIM.

Image for post
Image for post
Modified from source

Trust region policy optimization TRPO

Finally, we will put everything together for TRPO. TRPO applies the conjugate gradient method to the natural policy gradient. But it is not enough. If something is too good to be true, it may not. The trusted region for the natural policy gradient is very small. We relax it to a bigger tunable value. In addition, the quadratic approximation we made also decreases the accuracy. These factors may reintroduce problem to the policy updates. In some training iterations, performance may degrade. One mitigation is to verify the new policy first before commits the change. To do that, we verify

  • KL-divergence for the new policy θ is ≤ δ and
  • 𝓛(θ) ≥ 0.

If the verification fails, we will decay the natural policy gradient by a factor of α (0<α<1) until the new parameters meet the requirement above. The following describes the line search solution above:

Image for post
Image for post

TRPO Algorithm

TRPO combines both Truncated Natural Policy Gradient (using the conjugate gradient) and the backtracking line search:

Image for post
Image for post
Source

TRPO shortcomings

TRPO minimizes the quadratic equation to approximate the inverse of F.

Image for post
Image for post

But, for each model parameter update, we still need to compute F. This hurts scalability:

  • Computing F every time for the current policy model is expensive, and
  • It requires a large batch of rollouts in to approximate F accurately.
Image for post
Image for post

In general, TRPO is less sample efficient compared to other policy gradient methods trained with first-order optimizers like Adam. Because of this scalability issue, TRPO is not practical for large deep networks. PPO & ACKTR are introduced to address these problems.

It is a long journey but congratulation you make it! I hope you have a far deeper understanding of TRPO now.

Credit and references

UC Berkeley RL course

TRPO paper

UC Berkeley RL Bootcamp

UCL RL course

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store