RL โ Appendix: Proof for the article in TRPO & PPO
Difference of discounted rewards
The difference in the discounted rewards for two policies is:
Proof:
Natural Policy Gradient is covariance
Approximate the difference of discounted rewards with ๐
Proof (assuming both policies are similar):
๐ match with K to the first order
i.e.
- K(ฯ) = ๐(ฯ), and
- Kโ(ฯ) = ๐ โ(ฯ)
Proof:
Approximate the expected rewards as a quadratic equation
For the objective
We can use Taylorโs series to expand both terms above up to the second-order. The second-order of ๐ is much smaller than the KL-divergence term and will be ignored.
After taking out the zero values:
where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter ฮธ.
Our objective can therefore be approximated as:
or
๐ & M function
We want to proof
During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).
- The difference in the discounted rewards between two different policies can be computed as (proof):
2. ๐ can be approximated as (proof)
3. When ฯโ = ฯ, the L.H.S. above is zero and we can show (proof)
The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at ฯโ = ฯ. This is one requirement for the MM algorithm.
Proof:
turns to
So M approximates our objective locally.