RL — Guided Policy Search (GPS)

Motivation of GPS

We take it for granted for the tasks shown in the video. Supervised learning, Policy gradient, or Imitation learning depends heavily on sampling to train the deep network. This is daunting for the time it takes. What challenges are we facing in reinforcement learning RL for the robotic control?

  • Robotic control uses the camera to observe the external environment. Inferring actions from high-dimensional tangled data is hard.
  • Executing partially learned but potentially bad policy puts robots into harm’s way.
  • Potential guesses may be ill-conditioned.
Modified from source
Modified from source

Guided Policy Search GPS

Model-based RL plans optimal actions based on the cost function and the model f.

Modified from source
Modified from source

Dual Gradient Descent DGD

DGD optimizes an objective under a constraint C:

Deterministic GPS

Let’s get through an example with a deterministic policy. First, we are going to simplify the notation a little bit. When we write the objective as,



A controller determines what actions to take. We run the controller on a robot and observe the trajectories. We use them as expert demonstrations to train a policy. This policy serves the same objective as the controller. But it is modeled with a deep network while the controller uses a trajectory optimization method to determine actions.

Imitate optimal control

To have accurate trajectory planning, we need to measure all the relevant states precisely. During training, we can deploy extra sensors and set up a controlled environment to collect those needed states, in particular for the external environments. Under these setups, it is not hard to plan the optimal controls. But how can we create a solution that does not require such heavy lifting in the deployment? For example, we can train a self-driving car with Lidar and video cameras. But can we skip the Lidar in the deployment?


Stochastic GPS

Previously, the controller and the policy is deterministic. How can we use GPS with a stochastic controller and policy? Actually, this requires very small changes.


Augmented Lagrangian

Next, we will look into a few methods to improve GPS. We add an additional quadratic penalty for the divergence between the controller and the policy. The new Lagrange becomes

Multiple trajectories

We want to optimize multiple trajectories in parallel instead of a single trajectory. The benefit is similar to the mini-batch.



This article was written a few years ago. I have not reviewed this lately. Please put messages in the comment section for any needed updates.




Deep Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

FaceMaskDetection : A Machine Learning Model to Determine if a Person is Wearing a Mask

A Pseudo-Mathless Approach to Reenforcement Learning in 5 Minutes

How to Summarize Financial News with Abstractive Summarization?

Bert For Topic Modeling ( Bert vs LDA )

Applications of Graph Neural Networks (GNN)

Intuitive Hyperparameter Optimization : Grid Search, Random Search and Bayesian Search!

Natural Language Processing — Kick Start with NLTK

How to implement a Neural network from scratch.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jonathan Hui

Jonathan Hui

Deep Learning

More from Medium

Everything you need to know about Reinforcement Learning in <80minutes

When to Use Deep Learning

Reinforcement learning for dummies like me

REINFORCE — Empower something that has potential.