RL — Transfer Learning (Learn from the Past)

Jonathan Hui
8 min readJun 27, 2022
Photo by Bryce Evans

Humans are explorers and we do it smartly. In reinforcement learning RL, model-free methods search the solution space millions of times. This is hardly efficient. Humans also learn from experience which is an integral part of intelligence. We modify skills to tackle new problems. But in many deep learning or RL methods, we train each task independently. We throw away our experiences as if they are not important. In this article, we will focus on these two particular challenges: exploration and transfer learning.

Transfer Learning

We can use a robot to peel lettuce now. Technically, it is hard for a robot. However, it needs to perform a few dozen tasks before making it commercially viable. Otherwise, there is no competitive edge over other specialized machines. We also want the solution to be robust enough to handle different scenarios. However, the possible combinations are not manageable if the robot learns each task independently. It wouldn’t scale. To be effective, we need to learn from past experience and transfer learned knowledge from one task to other. This is the topic of transfer learning in RL.


Deep learning DL faces the same issues but it is much harder for RL because most extracted features or learned policies are highly specialized for the task we perform. They are not easily transferable. Nevertheless, many DL transfer learning techniques are still very beneficial in RL. Many readers may be familiar with the DL transfer learning already, so we will go directly to a more advanced method called progressive network.

In DL, if the network is very expressive, we risk the chance of overfitting. Hence, we don’t want to retrain the whole network since we don’t want to sample too much new data.

The progressive network above composes of a pre-trained network and a new but smaller network (the bottom one in the figure above). To avoid overfitting, the parameters in the larger network will be frozen and we only train the smaller network to extract task-specific features. The generic features will be provided directly from the larger network to the smaller network. Unlike many transfer learning methods, the larger network’s parameters will never override and therefore we never forget the learned experience.

Overconfidence hurt

Overcondiference hurts in DL and is even worst in RL. During the finetuning, overconfidence policies lack the stochastic behavior and the randomness for effective exploration. Certainty also lacks the flexibility to handle changes in environmental conditions. In inferencing, a stochastic policy could help us to break out of a deadlock situation when conditions change.


To encourage the diversity of our actions to handle changes, we can add an additional objective in measuring the entropy of the policy. The higher the entropy, the higher the diversity is.



To increase the robustness of our solution, we need to train with different configurations. For example, we should train with different widths of the walker below. Hopefully, by training with many scenarios, the trained model will be generalized enough for widths that have not been trained before. This is the basic principle behind DL.


In DL, we augment data so the solution generalizes better. In RL, we vary objects, environments, and goals during training to improve the robustness of the solution. For example, we include objects of different shapes or change our target location.

Left, Right

Let’s study another example of flying a droid indoors. We want to change the environments with different space and object configurations. We want to have different types of objects (walls, people, and furniture) to be in our way. However, this is not feasible in the real world as the combinations can be too much. But we can train our model in one source domain while hoping it can deploy successfully in another. In this example, the source domain is the virtual world composed of synthetic images. It adds walls, and corners and rearrange the furniture to simulate different environments. Once it is trained, we deploy it to the target domain — the real world.

Modified from source

It turns out to work nicely.

Domain adaptation

However, we may need to bridge the gap between the source and the target domain sometimes. In GAN, we can convert an image from one domain into another.


We can apply the same concept to synthetic images to make them look real. As shown below, the robot arm in the middle looks much closer to the real one now.

Left: simulator images, Middle: adapted images, Right: real images. Source

Domain Randomization

The Domain Adaptation method above tries to create synthetic data that looks real. Domain Randomization adapts a different approach. It trains models on simulated low-fidelity images. We randomize the rendering with random camera positions, lighting conditions, object positions, and non-realistic textures. Our attention is not making it real. With enough variation, the model is generalized to handle many variants which hopefully include the real world. The principle is by increasing the diversity of the source domain, we can cover a wider range of target domains.


In RL, the more diversity of the source domain, the better the model will be.

Data v.s. Task

Before Galileo, we had a complex geocentric model showing how the sun and planets move around the earth. The model is complex and would not make predictions other than the movements of nearby planets. Newton unified the concept of motions that covers the big and the small. We no longer have separate theories for astronomy or the general mechanics. By putting things in the same context, a much deeper understanding of Physics is discovered.

In DL, we want to avoid data overfitting. So far, we train an RL system to handle diversified scenarios. But we should push further.

For RL, we want to avoid task overfitting.

In other words, DL extracts common features among data and we want RL to discover the common pattern among tasks. We train our model with a large variant of tasks, so we remove noisy information and the discovered rules can be more fundamental. Let’s look into some examples.

Model-based reinforcement learning

Above, we can learn the model (system dynamics) of a robot arm using multiple tasks on the left. The common pattern that we want to discover is the law of motions applied to the robot arm. To check the robustness of what we learn, we will adopt a one-shot method on a never learned task. i.e. we only give one attempt for the task to complete. The experiment above is one toy experiment from Google and UC Berkely to teach robots on grasping objects. It tests how well we learn in handling unknown situations (untrained conditions).

During training, we use different tasks to fit the global model. In testing, we observe the corresponding trajectory after taking each action. We develop a local model. Then we finetune our action plan by combining the global model with this local model.

Modified from source

Actor-mimic and policy distillation

Let’s look at another training method for an AI playmate that plays multiple Atari games. In the DQN paper, it trains each Atari game separately. Can we have a single policy that plays all Atari games? Once, it masters a few games, can it play a totally new game with no or little retraining.

In the space invader game, when the alien fire at us, we run away. If our policy is only trained with this game, we will perform badly in the pong game. We want to hit the ball but not run away. Our policy is not generalized. By training with multiple tasks, we gain better and more fundamental knowledge. We should be alerted when an object is approaching. Based on the context, we act differently. For example, in the Pac-Mac game, we want to run away from the ghosts. But when we just capture a Power Pellets, we can chase the ghosts and eat them. Our new policy is more versatile to handle different games and far more fundamental.

Following is one possibility to train a single policy from multiple tasks. We train it by minimizing the cross-entropy between individual policy and the combined policy. In short, we want to mimic all the individual policies closely. We do not need all individual policies to be mature before training the combined policy, Indeed, we train the combined policy with the individual policy in alternating steps. We can lock them in close steps in improving each other using the concept like the Guided Policy Search.


We combine learning skills to tackle problems. Just like software development, the best way to promote usability is to have a modular design.

In robotic control, policies can be decomposed into task-specific and robot-specific. By blending them together, we can form a much rich combination and we don’t need to train each combination one by one.


After each module is trained, we can concentrate them together for different configurations.

For example, if we have two robot modules and two task modules, there should be four different configuration combinations. As shown below, we can just train three of them. If our solution is robust, the solution should cover the fourth combination without training it before.