Sitemap

RL — Guided Policy Search (GPS)

10 min readJan 18, 2022

With Guided Policy Search GPS, a robot learns each skill in the video in 20 minutes. If it is trained by the Policy Gradient methods, it will take weeks. The demonstration also shows the robot can handle scenarios that are not trained before.

In RL, success is measured by the robustness of the solution and how well it generalizes.

In this article, we discuss GPS which takes advantage of the sample efficiency in the model-based learning while creating a policy that generalizes better.

Motivation of GPS

We take it for granted for the tasks shown in the video. Supervised learning, Policy gradient, or Imitation learning depends heavily on sampling to train the deep network. This is daunting for the time it takes. What challenges are we facing in reinforcement learning RL for the robotic control?

  • The physical simulation for many RL methods takes weeks for the training.
  • Robotic control uses the camera to observe the external environment. Inferring actions from high-dimensional tangled data is hard.
  • Executing partially learned but potentially bad policy puts robots into harm’s way.
  • Potential guesses may be ill-conditioned.
Press enter or click to view image in full size
Modified from source
  • Policy drift may lead us to states that we never trained before.
Press enter or click to view image in full size
Modified from source

Model focus or Policy focus

Press enter or click to view image in full size

In Model-based RL, we plan a path to reach a goal (destination) with the lowest cost. Model-based RL has a strong advantage of being sample efficient. For example, if it is appropriate to approximate the dynamics as locally linear, it will take much fewer samples to learn the model (the system dynamics). Once the model and the cost function are known, we can plan the optimal actions without further sampling. But be warned, to compensate for fewer samples, the process needed to be more accurate, and the optimization is computationally intense.

In many Policy-based methods, we search the solution space similar to a trial-and-error method. Even though it prioritizes searches, it may be impractical for robotic research since the physical simulation takes too much time for a single experiment. This can be more acceptable if the robot is self-trained.

Is learning a policy easier than a model? Which one is easier to generalize? That strongly depends on the tasks. For some tasks, the policy may be simpler, at least for a human. Most people have a decent policy to balance a cart-pole which can be generalized to other balancing acts. But figuring out the exact dynamics will be hard.

Press enter or click to view image in full size

Model-based RL can optimize a trajectory for a scenario well but may not be generalized well. For some tasks, policy is easier to learn and to generalize, but it is not efficient in collecting good samples.

RL methods are rarely mutually exclusive. Can we combine both model and policy learning for good sample efficiency and generalization?

Guided Policy Search GPS

Model-based RL plans optimal actions based on the cost function and the model f.

Press enter or click to view image in full size

We run a controller p(u|x) on the robot. This controller decides the action to be taken at a state. We execute the actions and collect the trajectory information.

Press enter or click to view image in full size

Initially, p can be a random controller taking random actions as long as it is safe.

Press enter or click to view image in full size
Modified from source

We use the collected trajectory to fit the model and use iLQR or other trajectory optimization methods to improve the controller.

Press enter or click to view image in full size

We run the controller again and refine it iteratively.

Guided Policy Search learns and uses such a model for planning. But the real intention is training a policy.

Press enter or click to view image in full size
Modified from source

In GPS, we use sampled trajectory to train a policy using supervised learning.

Press enter or click to view image in full size
Press enter or click to view image in full size

We also add a new constraint to the optimization problem. We want the actions taken to match with the policy.

Press enter or click to view image in full size

Basically, this optimization can be solved using Dual gradient descent DGD. Let’s take a short break on DGD now.

Dual Gradient Descent DGD

DGD optimizes an objective under a constraint C:

Press enter or click to view image in full size

The key idea is transforming the equation into a Lagrange dual function which can be optimized iteratively. First, we define the corresponding Lagrangian 𝓛 and the Lagrange dual function g as:

Press enter or click to view image in full size

The dual function g is a lower bound for the original optimization problem. If the original optimization is a convex problem, the maximum value of g will likely equal the minimum values of the optimization problem. Hence, the optimization problem is translated into finding λ that maximize g.

So we start with a random guess of λ. We optimize 𝓛 in step 1 below. In steps 2 & 3, we update λ using the gradient ascent to maximize g. We alternate between minimizing the Lagrangian 𝓛 and g until g converges. The converged g is our optimal solution.

Let’s use DGD to solve our optimization problem. The algorithm becomes:

Press enter or click to view image in full size

Steps 1 & 2 above are the original step 1. We just split x into τ and θ which can be optimized independently. And we consolidate the original Gradient ascent steps into step 3 above.

Deterministic GPS

Let’s get through an example with a deterministic policy. First, we are going to simplify the notation a little bit. When we write the objective as,

Press enter or click to view image in full size

We really mean:

Press enter or click to view image in full size

Just make the equation more readable. Our objective will be written as:

Press enter or click to view image in full size

The algorithm

Press enter or click to view image in full size
Source

will be rewritten with some new terms:

Press enter or click to view image in full size
Source

In step 1, we add the constraint

Press enter or click to view image in full size

to the Lagrangian as:

Press enter or click to view image in full size

In step 2, c(τ) is independent of θ and can be removed.

Press enter or click to view image in full size

This is simply supervised learning using the sampled trajectories:

Press enter or click to view image in full size
Press enter or click to view image in full size

If we put back the model constraint, the algorithm will look like:

Press enter or click to view image in full size

In step 1, we will use a trajectory optimization method like iLQR to find the optimal control. In step 2, we use the Gradient descent method to learn the policy.

Intuition

A controller determines what actions to take. We run the controller on a robot and observe the trajectories. We use them as expert demonstrations to train a policy. This policy serves the same objective as the controller. But it is modeled with a deep network while the controller uses a trajectory optimization method to determine actions.

We start executing a random controller on a robot. The observed trajectories are used to model the system dynamics. As we know the system dynamics better, the controller is getting better also. However, we want to limit the controller’s change in each iteration. Big changes will potentially lead to larger errors, and bad decisions jeopardize the training progress. In addition, if the action changed constantly for the same state, it destabilizes the policy learning. Hence, we add a new constraint.

Press enter or click to view image in full size

In trajectory planning, this constraint penalizes a trajectory if the policy is different from the controller’s action. In practice, this limits the amount of change to the controller. This allows the trajectory to change gradually with some breathing room for the policy to learn.

Isn’t the controller and the policy doing the same thing?

The controller builds up an understanding of how the dynamics work and use trajectory optimization to calculate the optimal controls. Even though these methods are complex, they can produce accurate results. However, if the model is complex, such controller may be accurate around the collected trajectories only. It will perform badly in spaces where the model is very different from what it has been trained. In short, the controller can be very optimal in specific areas but the solution may not generalize well in others.

Many policy-based methods are just an educated trial-and-error method. It consumes a lot of samples and spends a lot of effort in searching the policy space. But if a policy is much simpler than a model, what we really need is optimal trajectories for it to learn. This can be provided by the controller. If the policy is really simple, it has fewer scenarios to deal with. With a decent coverage of the trajectory sample, we should have a policy that generalizes well.

Imitate optimal control

To have accurate trajectory planning, we need to measure all the relevant states precisely. During training, we can deploy extra sensors and set up a controlled environment to collect those needed states, in particular for the external environments. Under these setups, it is not hard to plan the optimal controls. But how can we create a solution that does not require such heavy lifting in the deployment? For example, we can train a self-driving car with Lidar and video cameras. But can we skip the Lidar in the deployment?

Press enter or click to view image in full size

If we cannot innovate, we imitate. In training, we have access to needed states. Then we can use trajectory planning to calculate the optimal controls. It will be computationally intense but it generates nice trajectories. Those trajectories can be repurposed as labels to train another policy with part of states observed from other means.

Press enter or click to view image in full size

i.e. we use supervised learning to train the second policy that takes different parameters.

Press enter or click to view image in full size
Press enter or click to view image in full size
Source

In this process, we train two different models that take different parameters but generate the same policy that minimizes the trajectory cost.

Press enter or click to view image in full size
Source

Stochastic GPS

Previously, the controller and the policy is deterministic. How can we use GPS with a stochastic controller and policy? Actually, this requires very small changes.

Our objective is defined as:

Press enter or click to view image in full size

where both p and π are stochastic. The Lagrangian is modified and the difference between these two probability distributions are measured with the KL-divergence:

Press enter or click to view image in full size

Then, we can use any trajectory optimization method to find the optimal τ.

Press enter or click to view image in full size
Source

If we compare it with the Deterministic GPS, we can realize the difference is very small.

Press enter or click to view image in full size
Press enter or click to view image in full size
Source

Time-varying Linear-Gaussian controller

Let’s look into one particular example in using a Linear Gaussian controller:

Press enter or click to view image in full size

The action is Gaussian Distributed with means calculated from linear dynamics. The model is also modeled by a Gaussian distribution.

Press enter or click to view image in full size

For this controller, we can apply iLQR to find the optimal controls. The KL-constrained in the optimization can be computed easily.

Press enter or click to view image in full size

For example, the KL-divergence for two one-dimensional Gaussian distributions is:

Press enter or click to view image in full size

The generic formula is:

Press enter or click to view image in full size

Augmented Lagrangian

Next, we will look into a few methods to improve GPS. We add an additional quadratic penalty for the divergence between the controller and the policy. The new Lagrange becomes

Press enter or click to view image in full size

We also add an extra term for the policy training:

Press enter or click to view image in full size

Multiple trajectories

We want to optimize multiple trajectories in parallel instead of a single trajectory. The benefit is similar to the mini-batch.

Press enter or click to view image in full size
Press enter or click to view image in full size
Source

Note

This article was written a few years ago. I have not reviewed this lately. Please put messages in the comment section for any needed updates.

--

--

Responses (1)