θ is only updated once for every τ which is the whole trajectory/episode.

BTW, Policy Gradient with Monte Carlo rollout update once per episode. There are other ways to approximate rewards.
θ is only updated once for every τ which is the whole trajectory/episode.
BTW, Policy Gradient with Monte Carlo rollout update once per episode. There are other ways to approximate rewards.