RL โ€” Appendix: Proof for the article in TRPO & PPO

Jonathan Hui
3 min readAug 2, 2018

Difference of discounted rewards

The difference in the discounted rewards for two policies is:

Proof:

Source

Natural Policy Gradient is covariance

Source

Approximate the difference of discounted rewards with ๐“›

Proof (assuming both policies are similar):

๐“› match with K to the first order

i.e.

  • K(ฯ€) = ๐“›(ฯ€), and
  • Kโ€™(ฯ€) = ๐“› โ€™(ฯ€)

Proof:

Approximate the expected rewards as a quadratic equation

For the objective

We can use Taylorโ€™s series to expand both terms above up to the second-order. The second-order of ๐“› is much smaller than the KL-divergence term and will be ignored.

After taking out the zero values:

where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter ฮธ.

Our objective can therefore be approximated as:

or

๐“› & M function

We want to proof

During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).

  1. The difference in the discounted rewards between two different policies can be computed as (proof):

2. ๐“› can be approximated as (proof)

3. When ฯ€โ€™ = ฯ€, the L.H.S. above is zero and we can show (proof)

The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at ฯ€โ€™ = ฯ€. This is one requirement for the MM algorithm.

Proof:

turns to

So M approximates our objective locally.

--

--