RL — Guided Policy Search (GPS)

Jonathan Hui
10 min readJan 18, 2022


With Guided Policy Search GPS, a robot learns each skill in the video in 20 minutes. If it is trained by the Policy Gradient methods, it will take weeks. The demonstration also shows the robot can handle scenarios that are not trained before.

In RL, success is measured by the robustness of the solution and how well it generalizes.

In this article, we discuss GPS which takes advantage of the sample efficiency in the model-based learning while creating a policy that generalizes better.

Motivation of GPS

We take it for granted for the tasks shown in the video. Supervised learning, Policy gradient, or Imitation learning depends heavily on sampling to train the deep network. This is daunting for the time it takes. What challenges are we facing in reinforcement learning RL for the robotic control?

  • The physical simulation for many RL methods takes weeks for the training.
  • Robotic control uses the camera to observe the external environment. Inferring actions from high-dimensional tangled data is hard.
  • Executing partially learned but potentially bad policy puts robots into harm’s way.
  • Potential guesses may be ill-conditioned.
Modified from source
  • Policy drift may lead us to states that we never trained before.
Modified from source

Model focus or Policy focus

In Model-based RL, we plan a path to reach a goal (destination) with the lowest cost. Model-based RL has a strong advantage of being sample efficient. For example, if it is appropriate to approximate the dynamics as locally linear, it will take much fewer samples to learn the model (the system dynamics). Once the model and the cost function are known, we can plan the optimal actions without further sampling. But be warned, to compensate for fewer samples, the process needed to be more accurate, and the optimization is computationally intense.

In many Policy-based methods, we search the solution space similar to a trial-and-error method. Even though it prioritizes searches, it may be impractical for robotic research since the physical simulation takes too much time for a single experiment. This can be more acceptable if the robot is self-trained.

Is learning a policy easier than a model? Which one is easier to generalize? That strongly depends on the tasks. For some tasks, the policy may be simpler, at least for a human. Most people have a decent policy to balance a cart-pole which can be generalized to other balancing acts. But figuring out the exact dynamics will be hard.

Model-based RL can optimize a trajectory for a scenario well but may not be generalized well. For some tasks, policy is easier to learn and to generalize, but it is not efficient in collecting good samples.

RL methods are rarely mutually exclusive. Can we combine both model and policy learning for good sample efficiency and generalization?

Guided Policy Search GPS

Model-based RL plans optimal actions based on the cost function and the model f.

We run a controller p(u|x) on the robot. This controller decides the action to be taken at a state. We execute the actions and collect the trajectory information.

Initially, p can be a random controller taking random actions as long as it is safe.

Modified from source

We use the collected trajectory to fit the model and use iLQR or other trajectory optimization methods to improve the controller.

We run the controller again and refine it iteratively.

Guided Policy Search learns and uses such a model for planning. But the real intention is training a policy.

Modified from source

In GPS, we use sampled trajectory to train a policy using supervised learning.

We also add a new constraint to the optimization problem. We want the actions taken to match with the policy.

Basically, this optimization can be solved using Dual gradient descent DGD. Let’s take a short break on DGD now.

Dual Gradient Descent DGD

DGD optimizes an objective under a constraint C:

The key idea is transforming the equation into a Lagrange dual function which can be optimized iteratively. First, we define the corresponding Lagrangian 𝓛 and the Lagrange dual function g as:

The dual function g is a lower bound for the original optimization problem. If the original optimization is a convex problem, the maximum value of g will likely equal the minimum values of the optimization problem. Hence, the optimization problem is translated into finding λ that maximize g.

So we start with a random guess of λ. We optimize 𝓛 in step 1 below. In steps 2 & 3, we update λ using the gradient ascent to maximize g. We alternate between minimizing the Lagrangian 𝓛 and g until g converges. The converged g is our optimal solution.

Let’s use DGD to solve our optimization problem. The algorithm becomes:

Steps 1 & 2 above are the original step 1. We just split x into τ and θ which can be optimized independently. And we consolidate the original Gradient ascent steps into step 3 above.

Deterministic GPS

Let’s get through an example with a deterministic policy. First, we are going to simplify the notation a little bit. When we write the objective as,

We really mean:

Just make the equation more readable. Our objective will be written as:

The algorithm


will be rewritten with some new terms:


In step 1, we add the constraint

to the Lagrangian as:

In step 2, c(τ) is independent of θ and can be removed.

This is simply supervised learning using the sampled trajectories:

If we put back the model constraint, the algorithm will look like:

In step 1, we will use a trajectory optimization method like iLQR to find the optimal control. In step 2, we use the Gradient descent method to learn the policy.


A controller determines what actions to take. We run the controller on a robot and observe the trajectories. We use them as expert demonstrations to train a policy. This policy serves the same objective as the controller. But it is modeled with a deep network while the controller uses a trajectory optimization method to determine actions.

We start executing a random controller on a robot. The observed trajectories are used to model the system dynamics. As we know the system dynamics better, the controller is getting better also. However, we want to limit the controller’s change in each iteration. Big changes will potentially lead to larger errors, and bad decisions jeopardize the training progress. In addition, if the action changed constantly for the same state, it destabilizes the policy learning. Hence, we add a new constraint.

In trajectory planning, this constraint penalizes a trajectory if the policy is different from the controller’s action. In practice, this limits the amount of change to the controller. This allows the trajectory to change gradually with some breathing room for the policy to learn.

Isn’t the controller and the policy doing the same thing?

The controller builds up an understanding of how the dynamics work and use trajectory optimization to calculate the optimal controls. Even though these methods are complex, they can produce accurate results. However, if the model is complex, such controller may be accurate around the collected trajectories only. It will perform badly in spaces where the model is very different from what it has been trained. In short, the controller can be very optimal in specific areas but the solution may not generalize well in others.

Many policy-based methods are just an educated trial-and-error method. It consumes a lot of samples and spends a lot of effort in searching the policy space. But if a policy is much simpler than a model, what we really need is optimal trajectories for it to learn. This can be provided by the controller. If the policy is really simple, it has fewer scenarios to deal with. With a decent coverage of the trajectory sample, we should have a policy that generalizes well.

Imitate optimal control

To have accurate trajectory planning, we need to measure all the relevant states precisely. During training, we can deploy extra sensors and set up a controlled environment to collect those needed states, in particular for the external environments. Under these setups, it is not hard to plan the optimal controls. But how can we create a solution that does not require such heavy lifting in the deployment? For example, we can train a self-driving car with Lidar and video cameras. But can we skip the Lidar in the deployment?

If we cannot innovate, we imitate. In training, we have access to needed states. Then we can use trajectory planning to calculate the optimal controls. It will be computationally intense but it generates nice trajectories. Those trajectories can be repurposed as labels to train another policy with part of states observed from other means.

i.e. we use supervised learning to train the second policy that takes different parameters.


In this process, we train two different models that take different parameters but generate the same policy that minimizes the trajectory cost.


Stochastic GPS

Previously, the controller and the policy is deterministic. How can we use GPS with a stochastic controller and policy? Actually, this requires very small changes.

Our objective is defined as:

where both p and π are stochastic. The Lagrangian is modified and the difference between these two probability distributions are measured with the KL-divergence:

Then, we can use any trajectory optimization method to find the optimal τ.


If we compare it with the Deterministic GPS, we can realize the difference is very small.


Time-varying Linear-Gaussian controller

Let’s look into one particular example in using a Linear Gaussian controller:

The action is Gaussian Distributed with means calculated from linear dynamics. The model is also modeled by a Gaussian distribution.

For this controller, we can apply iLQR to find the optimal controls. The KL-constrained in the optimization can be computed easily.

For example, the KL-divergence for two one-dimensional Gaussian distributions is:

The generic formula is:

Augmented Lagrangian

Next, we will look into a few methods to improve GPS. We add an additional quadratic penalty for the divergence between the controller and the policy. The new Lagrange becomes

We also add an extra term for the policy training:

Multiple trajectories

We want to optimize multiple trajectories in parallel instead of a single trajectory. The benefit is similar to the mini-batch.



This article was written a few years ago. I have not reviewed this lately. Please put messages in the comment section for any needed updates.