RL — Guided Policy Search (A walkthrough)

Jonathan Hui
9 min readApr 6, 2022

In the previous article, we discuss the concept of the Guided Policy Search. Now we look into how it is trained.

Partially observable MDP (POMDP)

During training, we can establish a well-controlled environment to collect state information. However, to deploy the solution in real life, we don’t have such luxury.

Modified from source

For example, during training, we can hold the cube with the left robot arm. This gives us quality state information regarding the location of the target object and we can use it to pre-train the robot arm to maneuver to a target position.

However, for testing, we should place the cube anywhere to test the robustness of the solution. The states (the pose of the gripper, arm’s angle, etc…) of the right robot arm are still known. But states for the external environments, like the target location, are missing and need to be observed from the robot’s camera. In fact, we do not tell what is missing. We let the robot learn it by itself. For example, to learn how to use a hammer, the robot needs to learn the grasp is important and start observing it from the camera.

Modified from source

The policy for a fully observable MDP is:

But during testing, states for the external environments are missing.

The system derives the optimal actions from the states of the robot arm and the observations from the video.

Feature points

To accomplish that, we extract feature points from the video for trajectory optimization.

Conceptually, the feature points are locations that we want to track. As shown below, those include the positions for the Lego block and the end-effector.


Combining the feature points with the state information of the robot arm, we can fit the model and plan the optimal trajectory with the controller.

To verify the result, we can plot these feature points during testing. As shown below, the photo shows two feature points: where the gripper holds the spatula and the edge of the bag. We change the dot color from red to yellow and then to green over time. As shown, these feature points successfully track the movement and the state of the spatula and the bag. So the feature points do provide the state information of the external environment that we need for trajectory optimization.

These feature points are detected using a CNN autoencoder. We train a CNN encoder to extract feature points and then use a decoder to reconstruct the image from them. We train this autoencoder to minimize the reconstruction error.

The feature point is calculated as the highest activation point in each channel in the last convolution layer.

Modified from source

The following photos demonstrate some of the feature points learned.


Training steps

Let’s walk through the training procedure for more details:

  1. Set target end-effector pose.
  2. Train an exploratory controller.
  3. Learning the embedding for images.
  4. Provide the target goal.
  5. Train the final controller to achieve the final goal.

Before looking into the details, this is what the training looks like:

Step 1: Set target end-effector pose

We set the target pose of the robot’s end-effector which is defined by three 3D-points of the end-effector.

Step 2: Train an exploratory controller

Once the target pose is identified, an exploratory controller is trained:

The PR2 robotic arm has seven degrees of freedom. From the states of the robot arm (including the joint angles, velocity, and end-effector positions) and the target pose, the controller determines the torques of seven motors. This step pre-trains the robot arm to maneuver to the target pose.

But at first, it initializes p to be a random controller taking random actions, but no large motions or unsafe moves. Then the system uses trajectory optimization to refine it. As shown in the video, the initial random controller explores space near the initial position.

Run the controller on the robot and observe the trajectory.

We refine the exploratory controller by minimizing a loss function to reach the target pose as close as possible with the smallest effort.

Our goal is to learn the dynamics and refine the controller.

Modified from source

There are many options to learn both. The following is one possible method:

Modified from source

It fits the model with the observed trajectories after running the controller p on the robot. In this example, it observes the next five action sequences (the next five seconds) to fit the model.

We can use iLQR to optimize the trajectory. iLQR is an iterative process to find the optimal controls if the model and the cost function are known. The algorithm above also applies a constraint to limit the controller change to avoid bad decisions.

The optimized exploratory controller does not know how to push a Lego block yet. The intention of the exploratory controller is to produce promising trajectories. Then it records images to learn the feature points. As shown in the video, the refined exploratory is far more relevant to our task than random actions.

Step 3: Learning the embedding of images

It plays the refined exploratory controller and collect images for the trajectories. We use a CNN autoencoder to learn the feature points.

Modified from source

These learned feature points have no guarantee that they are related to the task. But in practice, it does have some reasonable success to some extent.

We can also use a robot arm to hold the cube so we can derive the target pose of the cube (three 3-D points in the cube on the right-hand side below). With supervised learning, we can connect the feature points to a dense layer to predict the target pose. Therefore, we can train the CNN in conjunction with supervised learning to make sure the CNN extracts features related to the target pose of the cube.

Modified from source

Step 4: Provide the target goal

Next, it identifies the target location of where the Lego block should go. For example, we can click on the image of the robot camera (the green dot on the left below).

Step 5: Train the final controller to achieve the final goal

Now, it concatenates the joint angles, end-effector positions, and feature points to form the state space of the POMDP. Then it can be solved it with any RL methods.

Our cost function may look like

where d measures the distance of the features points and the end-effector location from their target locations. Then, it can use the iLQR to plan the optimal controls. The “o” below is the current location of the feature point and “x” are their corresponding targets calculated from the expected targets provided in the previous step.

Policy learning

The prime objective of GPS is to learn a policy. So while we are improving the controller, we also use the sampled trajectory to train a policy using supervised learning.



The training is mainly divided into two phases. In the pre-training, the robot is trained under controlled conditions. It uses the extra state information to learn an initial controller. It also collects images from this controller to train a CNN to extract feature points. Once it is done, this controller still cannot perform the task but gets much closer.

Then it applies GPS to train the controller with the states of the robot arm and the feature points observed from the robot’s camera. At the same time, it trains a policy in parallel with the sampled trajectory and make the policy in-sync with the controller such that the training is more stabilized.


Architecture design

The whole network composes of seven layers with 92K parameters. The first three are convolutional layers to extract features. Then it calculates the spatial softmax for each channel. Combining the feature points with the state of the robot, it uses three dense layers to calculate the torques of seven motors.


Revisited the problems

Let’s visit some of the problems mentioned in Part 1 of the article.

Reduce physical simulation

Model-based RL is more sample efficient compared with other RL methods. The training is in the term of minutes but not weeks. The model takes fewer samples to build and we can use trajectory optimization to complete the planning without further sampling.

Robotic control with the camera

In training, it uses trajectory optimization to generate samples to train another policy that uses camera observations as input.

This allows us to derive actions from video observations.

Don’t execute a bad policy that may harm the robot

We use sound methods to refine an exploratory controller in pre-training. When applying GPS, it limits the change in the controller for each time step so it will not be too aggressive and make bad decisions.

Potential solutions may be ill-conditioned

Modified from source

We use a collocation method for optimizing the trajectory, i.e. we try to explore both actions and states (instead of just actions) in searching for the optimal.

So the behavior is closer to the graph below:


Policy drift

Modified from source

As a direct quote from the Guided Policy Search paper:

The RL stage adapts to the current policy πθ(ut|ot), providing supervision at states that are iteratively brought closer to the states visited by the policy.

Credits & references

End-to-End Training of Deep Visuomotor Policies

UC Berkeley Reinforcement Course