If you substitute ∇log π(τ)

Image for post
Image for post

into the expectation

Image for post

it becomes

Image for post

But being said, quite a lot of people asking why the equation is different from Richard Sutton book (where it is also shown in David Silver slides also).

Image for post
Image for post

This is because Sutton & Silver starts from formulating the problem as follows:

Image for post
Image for post

where lead to a per timestep update.

Professor Levine formulates the problem as follows:

Image for post
Image for post

where lead to a per-trajectory update. I went back to the original Williams 1992 paper to find what is the original intention. I do not get a clear answer, in particular, I don’t want to represent what Williams has in his mind. But that is not that important because we are not historians and what we care is what to use.

Both formalizations have issues. Sutton’s equation can have convergence issue because of the correlation between actions in the same trajectory. The second one is awful in sample efficiency. So we need to see both as a preliminary step that leads us to more advanced algorithms like PPO.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store