If you substitute ∇log π(τ)

into the expectation

it becomes

But being said, quite a lot of people asking why the equation is different from Richard Sutton book (where it is also shown in David Silver slides also).

This is because Sutton & Silver starts from formulating the problem as follows:

where lead to a per timestep update.

Professor Levine formulates the problem as follows:

where lead to a per-trajectory update. I went back to the original Williams 1992 paper to find what is the original intention. I do not get a clear answer, in particular, I don’t want to represent what Williams has in his mind. But that is not that important because we are not historians and what we care is what to use.

Both formalizations have issues. Sutton’s equation can have convergence issue because of the correlation between actions in the same trajectory. The second one is awful in sample efficiency. So we need to see both as a preliminary step that leads us to more advanced algorithms like PPO.