REINFORCE algorithm is originated from Williams 1992 paper. I do think that the paper does leave some room on when to update the policy. Silver and Sutton update the policy in every timestep. The UC Berkely courses from my understanding update the policy after a trajectory. In this situation, and other topics, I try not to trace exactly what was exactly happening. Try not to be a historian. :-) Many research papers are often written in a way for maximum possibilities. In this particular case, I present the UC Berkeley version which I think may be more sensible.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store