RL — Appendix: Proof for the article in TRPO & PPO

Difference of discounted rewards

The difference in the discounted rewards for two policies is:

Image for post
Image for post

Proof:

Image for post
Image for post
Source

Natural Policy Gradient is covariance

Image for post
Image for post
Source

Approximate the difference of discounted rewards with 𝓛

Image for post
Image for post

Proof (assuming both policies are similar):

𝓛 match with K to the first order

i.e.

  • K(π) = 𝓛(π), and
  • K’(π) = 𝓛 ’(π)

Proof:

Image for post
Image for post

Approximate the expected rewards as a quadratic equation

For the objective

Image for post
Image for post

We can use Taylor’s series to expand both terms above up to the second-order. The second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.

Image for post
Image for post

After taking out the zero values:

Image for post
Image for post

where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter θ.

Our objective can therefore be approximated as:

Image for post
Image for post

or

Image for post
Image for post

𝓛 & M function

We want to proof

Image for post

During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).

Image for post
Image for post
  1. The difference in the discounted rewards between two different policies can be computed as (proof):
Image for post
Image for post

2. 𝓛 can be approximated as (proof)

Image for post
Image for post

3. When π’ = π, the L.H.S. above is zero and we can show (proof)

Image for post

The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at π’ = π. This is one requirement for the MM algorithm.

Proof:

Image for post
Image for post

turns to

Image for post
Image for post

So M approximates our objective locally.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store