Jonathan Hui·FollowJan 19, 2023--ListenSharethe new policy is the one estimated by the DNN. it is trained with gradient descent similar to the policy gradient.