RL — Prediction

Jonathan Hui
9 min readApr 26, 2022
Photo by Shane Hauser

How can we learn better? This is something we struggle with in real life also. Besides meta-learning, humans make predictions. Even if we don’t know the rules, we make predictions based on instincts.


Prediction plays a major role in decision-making. Sometimes, we predict the state after taking an action. Otherwise, we ask what actions are needed to take from one state to another.

Predict state from action

In the former case, we based our prediction based on the state o and the action u, and use a decoder to reconstruct the prediction.

For example, the left side below is the state and the action, and the right-hand side is the target state.

There are many methods to produce such predictions. In model-based learning, a model (the system dynamics) is used to make the prediction. In the following example, we use CDNA composed of convolutional layers to create kernels and masks. The kernel transforms the image and the mask keys out objects that we are not interested in. For example, one kernel is used to transform the circle object and another kernel for the square. To make the final prediction, we use the corresponding mask to key out irrelevant objects.

In CDNA, there are ten kernels to transform images and eleven masks, one more for the background. Our prediction is made by merging all the transformed images with the corresponding mask together.

This training is completely non-supervised. Once it is done, we can click on the image to instruct where the object should move to. In the photo below, the red dots are the initial state and the green dots are the target location.

Once the prediction model is trained, we use planning to plan our actions. We

  • Sample many potential action sequences,
  • Predict the future flow for each action sequence,
  • Pick the best result and execute the action, and
  • Repeat the steps until reaching the target location.

Prediction action from states

We can also predict actions that lead us from one state to another:

Below, we have two CNN that shared all the parameters. We input the before and after images after poking the bottle. Using both images, we use the network below to predict the action taken.


So given an initial state and the target state, we can derive a sequence of actions that take us there.


Feature representation

We can act only if we extract the right information. In the transfer learning section, we apply general features to solve a specific task. These general features can be learned from a large classification problem in ImageNet or from reconstructing images using an autoencoder.


However, there is no guarantee that the extracted features are helpful in performing the specific task. Can we ensure the extracted features are responsible for the action?

In the figure below, we encode features from the observation. We combine it with the action to predict the next state. Then we reconstructed the prediction using the decoder.

We will train the model to minimize the loss between our prediction and the next observation. Through this process, we train the model to extract features that are closely tight to the underneath task we perform.

For the remaining article, we will have an overview of some research papers.

(Note: this article was written three years ago. But I decided to publish it now for record-keeping and hopefully gives us some pointers on the topics.)

Embed to Control (E2C)

One of the challenges is to learn features that are relevant to performing tasks.

Given an observation, we can compute the optimal action and use the model (system dynamic) to find the next state. This is our prediction and we can use the decoder to reconstruct the image.

We can run the action on the robot and compare our prediction with the simulated result. We use the computed error to train the system end-to-end.

Let’s do it step-by-step

  1. Predict the latent z for the next state using the encoder and the model.
Modified from source

2. We can run the action and use the observed result and the encoder to compute the latent space.

3. Now we can train our model in minimizing the error in prediction and the observed result.

4. For completeness, we also add back the reconstruction flow.


Below is the reconstruction of the prediction for a moving robot arm.

Modified from source

Action-Conditional Video Prediction

We can have a more direct approach to making video predictions.

The first approach concatenates k frames together and processes them by a CNN encoder. So the input to the CNN has k × 3 channels.


The second approach can feed one image at a time and use LSTM to store all the history in its internal state.


The following is a more detailed architectural view of the first approach:


where we transform the encoder output with action that will be ready for decoding.

This is the corresponding design for the second approach.


Both methods have reasonable success for models that are relatively simple to predict. It has less success for non-synthetic images. But it is not clear how to plan optimal actions with it since there is no cost function or rewards defined for the task related to the predictions.

Informed exploration

But this is very useful for exploration. For space that we want to explore, we may want to visit states that look very different from the previous frame. We can use a Gaussian Kernel below to calculate the difference.

Action-conditioned multi-frame video prediction via flow prediction

Autonomous learning is very important in RL. The whole process may take a million video frames to train the system. Such real-life simulation is time-consuming. But the solution can still scale well if it does not involve expensive labeling efforts or expert demonstrations.

Previously, it predicts what the whole image may look like after taking some actions. What we may really want is to predict motion instead. To make such predictions, it predicts transformation matrices to transform the image (translate, rotate, etc …). But this operation should not apply to the whole image. It also predicts masks to mask out irrelevant areas. For example, the top section below applies a translate operation to move objects to the right. But it applies a mask so it only impacts the circle. In the bottom section, it moves objects 45° upwards and applies a mask that impacts the triangle object only.

Modified from source

Here, it uses the convolution network to analyze the image. It is built on top of the LSTM to make historical information. The design generates 10 transformation matrices to transform the images. It predicts 11 masks, one extra for the background. It masks out the irrelevant information for each transformed image and merges them together for the predictions.


The yellow arrow above is the skip connection.

For planning, it

  • Sample many potential action sequences.
  • Predict the future flow for each action sequence.
  • Pick the best result and execute the corresponding action.
  • Repeat the previous steps.

The training is completely non-supervised. Once it is done, we can click on the image to instruct where the object should move to.

This method handles real images. However, it is still in the early phase which deals with simple tasks with a simple background. Also, it is computationally intense.

Inverse models

If we want to predict the next state from the current action and state, we should be able to predict action from the initial and the target state.

A Siamese CNN Network is an identical NN sharing parameter. It processes two input images and predicts the actions that may lead to the target state.


The system continues to predict and execute actions.


This is the poke that it predicts in moving the bottles.


This method is self-trained. It doesn’t reconstruct images. It cares about the action but not the model itself. So it is not compatible with planning.

Predict alternative quantities

In this approach, we want to predict quantities that can be observed easily and are important for the task. For example, are we going to hit any obstacle?

What are the health and the ammo of the game below?


We are going to observe the difference between the current and the future values.

We want to maximize a goal u, parameterized by g. The goal is defined as the linear regression of f.

On the other hand, it approximates f with another model F, parameterized by θ. It takes the observation and the action as input.

Let’s find the action that can maximize the goal.

And train F to match with the observed f.

This method predicts quantities that we care about but need a handcrafted measurement that is observable.

Credit & reference

Embed to Control (E2C) paper

Action-Conditional Video Prediction paper

Unsupervised Learning for Physical Interaction through Video Prediction paper

Learning to Poke by Poking paper

Learning to Act by Predicting the Future paper