RL — Guided Policy Search (GPS)

Motivation of GPS

We take it for granted for the tasks shown in the video. Supervised learning, Policy gradient, or Imitation learning depends heavily on sampling to train the deep network. This is daunting for the time it takes. What challenges are we facing in reinforcement learning RL for the robotic control?

  • Robotic control uses the camera to observe the external environment. Inferring actions from high-dimensional tangled data is hard.
  • Executing partially learned but potentially bad policy puts robots into harm’s way.
  • Potential guesses may be ill-conditioned.
Modified from source
Modified from source

Guided Policy Search GPS

Model-based RL plans optimal actions based on the cost function and the model f.

Modified from source
Modified from source

Dual Gradient Descent DGD

DGD optimizes an objective under a constraint C:

Deterministic GPS

Let’s get through an example with a deterministic policy. First, we are going to simplify the notation a little bit. When we write the objective as,



A controller determines what actions to take. We run the controller on a robot and observe the trajectories. We use them as expert demonstrations to train a policy. This policy serves the same objective as the controller. But it is modeled with a deep network while the controller uses a trajectory optimization method to determine actions.

Imitate optimal control

To have accurate trajectory planning, we need to measure all the relevant states precisely. During training, we can deploy extra sensors and set up a controlled environment to collect those needed states, in particular for the external environments. Under these setups, it is not hard to plan the optimal controls. But how can we create a solution that does not require such heavy lifting in the deployment? For example, we can train a self-driving car with Lidar and video cameras. But can we skip the Lidar in the deployment?


Stochastic GPS

Previously, the controller and the policy is deterministic. How can we use GPS with a stochastic controller and policy? Actually, this requires very small changes.


Augmented Lagrangian

Next, we will look into a few methods to improve GPS. We add an additional quadratic penalty for the divergence between the controller and the policy. The new Lagrange becomes

Multiple trajectories

We want to optimize multiple trajectories in parallel instead of a single trajectory. The benefit is similar to the mini-batch.



This article was written a few years ago. I have not reviewed this lately. Please put messages in the comment section for any needed updates.




Deep Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deep Learning Contests: Steering angle Predictions

Firebase ML Lesson 01: Recognize Text and Label in Image using ML-Kit in Android🔥📱

FairVis — Discovering Bias in Machine Learning Using Visual Analytics

Ensembling methods and why they are most preferred among Machine Learning practitioners.

Using Deep Convolution Generative Adversarial Networks (DCGAN) to generate anime faces!!

Epileptic Seizure Classification ML Algorithms

Girvan Newman Algorithm — Community Detection in Network (Part 2)

YOLO V2 Configuration file Explained!!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jonathan Hui

Jonathan Hui

Deep Learning

More from Medium

#6: RL Goes Nuclear, RL Assists Anesthesiologists, MuZero Powers YouTube, Outracing Champion Gran…

Interaction-Grounded Learning: Learning from feedback, not rewards

When to Use Deep Learning

[Olivia Reading Notes] “Decision Transformer: Reinforcement Learning via Sequence Modeling”