RL — Model-Based Learning with Raw Videos

Jonathan Hui
9 min readApr 11, 2022

Vision is a critical part of intelligence and the decision-making process. Many toy experiments avoid raw image processing and handcraft features to simplify the task. But for real-life tasks, such handcrafting is labor-intensive and not necessarily transferable to other tasks.

Performing reinfocement learning from observing raw videos are critical.

Training v.s. Testing

During training, we can establish a well-controlled environment to identify the target state. However, in deploying the solution in real life, we don’t have such luxury.

Modified from source

For example, we can hold the cube above with the left robot arm. The robot arm knows its pose and location, and this information helps us to locate the position of the cube. This information is helpful in pre-training the robot arm to maneuver to a target position. But no “robot” hand-holding should be required during testing.

Modified from source

In testing, the states (the pose of the gripper, arm’s angle, etc…) of the right robot arm are still known. But states for the external environments are missing and need to be observed by the robot’s camera. In fact, we do not analyze what is missing. We let the robot learns it by itself. For example, to use a hammer, the robot needs to learn the grasp is important and observe it from raw images.

Modified from source

Partially observable MDP (POMDP)

Raw images are in high dimensional space with information entangled with each other. The robot needs to learn how to encode the observation to features needed for the task.

And use this encoding to determine the actions.

There are many possible solutions and we will discuss some of them.

Model in Latent Space

Encoding images is common in deep learning. The key concept is representing images in a low dimensional latent space with the least error when reconstructing the images back.

One way to compute the reconstruction error is to calculate the mean square error for the reconstructed pixels.

Autonomous reinforcement learning on the raw image

In RL, we can encode and decode a raw image using a CNN autoencoder. We minimize the reconstruction error to make sure the CNN is extracting critical features of the images.


Optionally, we add an L1 penalty to encourage a sparse representation.

There are many techniques in deep learning that we can use to encode images. For example, in a Variational Autoencoder, instead of learning the latent space z directly, we learn the Gaussian distribution of z. We sample from it and build an autoencoder with the least reconstruction error.

Once a lower-dimensional representation is extracted from the image, we can apply RL to those features. For example, we can play out multiple car races and fit the Q(g(o), a) function.


Collecting raw images is easy and encoding the visual information is heavily studied in deep learning. Nevertheless, the encoding is trained separately from the task. What we learned is not necessarily good for performing the task, or developing the system dynamics model. But with no surprise, it does work to some extent.

Deep Spatial Autoencoders

Let’s detail a method called the deep spatial autoencoder which utilizes raw images to complete RL tasks. The training process includes:

  1. Set target end-effector pose.
  2. Train an exploratory controller.
  3. Learning the embedding for images.
  4. Provide the target goal.
  5. Train the final controller to achieve the final goal.

Before looking into the details, this is what the training looks like:

Step 1: Set target end-effector pose

We set the target pose of the robot’s end-effector. It is defined by three 3D points of the end-effector which indicate its target position after pushing a Lego block.

In pre-training, the robot arm learns the dynamics in reaching a target position. Our pre-training objective is not learning how to push the Lego block. It just refines an exploration policy to produce promising trajectories. After the refinement, we record the trajectory images to train the image encoding.

Step 2: Train an exploratory controller

Once the target pose is identified, we train an exploratory controller:

The PR2 robotic arm has seven degrees of freedom. From the states of the robot arm (including the joint angles, velocity, and end-effector positions) and the target pose, the controller determines the torques of seven motors.

First, we initialize p to be a random controller taking random actions, but no large motions or unsafe moves. Alternatively, we can pre-programming it with some generic or educated moves. As shown in the video, the initial random controller explores space near the initial position.

We run the controller on the robot and observe the trajectory.

We refine the exploratory controller by minimizing a loss function. We want to reach the target pose as close as possible with the smallest effort.

Our final goal is to learn the dynamics and to refine the controller.

Modified from source

There are many options to learn both. The following is one possible method:

Modified from source

It fits the model with the observed trajectories after running the controller p on the robot. In this example, we observed the next five action sequences (the next five seconds) to fit the model. This model assumes a Gaussian distribution with the means calculated using linear dynamics.

Then we try to optimize the controller which is modeled by a Gaussian distribution also.

We call this Linear Gaussian Controller and Linear Gaussian Model.

We can use iLQR to optimize the trajectory. iLQR is an iterative process to find the optimal controls if the model and the cost function is known. The algorithm above also applies a constraint to limit the controller change to avoid bad decisions.

The optimized exploratory controller does not know how to push a Lego block yet. The intention of the exploratory controller is to collect data and raw images for the robot arm to reach the target pose. Using these images, we learn the image embedding. As shown in the video, the refined exploratory is far more relevant to our task than random actions.

Step 3: Learning the embedding of images

Then, we collect images for the refined trajectories and train a CNN autoencoder to extract feature points, and the location of important features. As discussed before, we use the reconstruction error to train the CNN.

Modified from source

But we can cheat a little bit. We can use a robot arm to hold the cube during training so we can derive the target pose of the cube (three 3-D points in the cube on the right-hand side below). With supervised learning, we can connect the feature points to a dense layer to predict these target pose. Therefore, we can train our CNN in conjunction with supervised learning to make sure some feature points are related to the target pose of the cube.

Modified from source

Feature points

The feature points are not the same as features in deep learning. Feature points are the locations of important features.

Intuitively, a feature point is the highest activation point for each channel. These feature points should identify locations that are important to the task, for example, the point where PR2 holds the spatula.

Modified from source

To extract them, we take a softmax over the pixels in each channel of the last convolutional layer:

where α is a learned parameter. The feature point for each channel is defined as:

Unfortunately, if those points are learned without the context of a task, there is no guarantee that we have the right feature points. Often, we cover non-important stuff with a blanket so the learning doesn’t get distracted. So how good are the feature points we extracted? The photos below indicate two feature points learned by the CNN. It locates where the robot is holding the spatula and the edge of the bag. The pictures track the locations of these feature points with red as their starting positions. We gradually change the color to yellow and then to green over time. As indicated, these feature points successfully track the movement and the state of the spatula and the bag.


So this strategy does provide critical state information for objects we want to track. Now, we are ready to combine the state of the robot arm with the feature points to fit the model and to learn the controller:

Step 4: Provide the target goal

Next, we will identify the target location of where the object should go. For example, we click on the green spots below to identify the target postion.

Step 5: Train the final controller to achieve the final goal

Now, we can concatenate the joint angles, end-effector positions, and feature points to form the state space of the POMDP. Then we just solve it with any RL methods.

Our cost function may look like

where d measures the distance of the features points and the end-effector location from their target locations. Then, we use iLQR to plan the optimal controls. The “o” below is the current location of the feature point and “x” are their corresponding targets. These targets can be calculated from the target location we provided in the previous step.

Guide Policy Search

While we use iLQR to plan the optimal trajectory, we can also use a dense network to learn a policy with trajectory samples created by the controller.

Modified from source

The basic concept is creating a policy that imitates what the controller does. If the policy model is simpler, it may generalize better to work with different scenarios. This is a pretty large topic and we will reserve another article to discuss this.

Credits & references

Deep Spatial Autoencoders for Visuomotor Learning paper

UC Berkeley Reinforcement Learning course

End-to-End Training of Deep Visuomotor Policies