RL — Transfer Learning

Jonathan Hui
8 min readApr 18, 2022

We learn from past experiences. We apply learned knowledge to solve new tasks. In Deep Learning, training a deep network takes a lot of time and samples. This sounds awfully inefficient.


In CNN networks, the first couple layers detect edges, strokes, or colors. To improve efficiency, we can train a network with ImageNet data and retrain it with images targeted for our problems. Since the network is already trained in detecting general features, this approach reduces the amount of time and samples needed in solving other problems. This approach is the poster child of transfer learning.

This kind of transfer learning can apply to NLP. In NLP, we can train a model in English and use fewer samples to train the model in handling a different language, even Chinese.

But, there are issues. The new data may override the generalized features that we learn. We lose our experience. If the training dataset for our training is small, we run into problems of overfitting and the solution will not be generalized. One common solution is retraining the last few layers only. But this will not be as optimized as end-to-end training.

Progressive networks

In a progressive net, it does not change the model parameters that are already trained. Instead, it builds a smaller model to be trained with the smaller dataset to avoid overfitting. But it feeds outputs from the trained model to the new model to enrich the feature capturing capabilities.

Maximum entropy training

In reinforcement learning RL, transfer learning is harder. The extracted features, value functions, and policies are more specialized and less transferable. The policy is optimized and lacks the stochastic behavior for explorations in transfer learning. To address that, we can add an incentive to the objective function to increase the entropy of our policy. This makes our actions more diversified and provides more choices in solving a problem. This increases the robustness of the system which may generalize better for different scenarios and other tasks.


For example, a more diversified policy allows us to solve problems better when situations are changed from the training.



One possible issue in our training is the lack of diversity to learn well.

Train the walker in different sizes. Source

So we can train the model with diversified physical parameters and hopefully it is generalized to parameters that have not been trained before. So rather than memorizing the solution, the diversified parameters force the training to learn how to solve the problem under different situations.


Real physical simulations are hard to diversify. But it is much cheaper to fabricate scenes with a computer. We can generate real-life scenery from a computer-generated semantic map with GAN.


Even better, we may be able to train from sympathetic graphics directly for a real-life problem.

Modified from source

We can randomize textures, lighting, viewing angles, and compose objects to form different scenery. We can add walls, corners, or corridors to the pictures. We can use these sympathetic data (without a real-life image) to train a real droid to fly inside a building.


In this example, it uses synthetic images for training and hopes that it is robust enough for the real world. But it can be better. In particular, synthetic image training may not work in the real world. Domain adaptation allows a model trained with samples from a source domain to generalize to a target domain. In this case, the source domain is the simulation, whereas the target is the real world. Consider this as bridging the reality gap.

Left: simulator images, Middle: adapted images, Right: real images. Source

The system takes synthetic images from the simulator and produces adapted images that look similar to the real world. Then the system is trained with the adapted images and the real-world image.

Here is the architectural design for the generator and the discriminator.

Modified from source

Multi-task transfer

Previously, we learn a task and transfer the knowledge to solve another one. We can start with a trained model and finetune it for another task. Or we can build a smaller network and make predictions with help from a bigger model trained for another task. Or we can improve the diversity of the training samples to make the solution more generalized.

Multi-task transfer learns how to solve multiple tasks. But forming a model that handles many tasks, we may discover the general patterns in solving these problems. The general patterns can be the laws of dynamics or a common approach that can solve a set of problems well.

Model-based reinforcement learning

One common example is the law of motion. Things move differently but all obey the law of Physics. So by learning the model from different tasks, we can create a model (the system dynamics) for the robot arm.

For example, we can learn the model from multiple tasks above. It adopts a one-shot training. It gives one attempt for the testing task to complete. Observing the trajectory, it combines the local model estimation with the prior belief from training to form a new model for trajectory planning.

Modified from source

Ensembles & Distillation

The ensemble method combines models together to make predictions. It is a powerful way to generalize solutions. We bet that models trained differently should not make the same mistake. Hence, the prediction can be more robust. Nevertheless, it is expensive since we need to calculate the results for all the models in the ensemble methods.

Distillation trains a single model on the ensemble’s predictions. We are not teaching this model to match the label of the training samples. Instead, we try to match it with the models’ predictions. What is the difference?

During training, we force ourselves to make a black and white decision and label it as seven. But the prediction from a model may contain richer information. For example, it may predict an 80% chance that it is a seven but also a 20% chance that it is a one. This information gives us richer information than a black-and-white decision. So when trying to build a system in handling multiple tasks, our interests are not in mimicking the labels of each task. We have a better chance to understand the basic rules by not treating it as a black or white issue.

Let’s have an example to demonstrate how can we benefit from multiple tasks training. In the space invader game, when the alien fire at us, we run away. If our policy is only trained with this game, we will perform badly in the pong game. We want to hit the ball but not run away. The policy is not generalized. By training with multiple tasks, we gain more fundamental knowledge. We should be alerted when an object is approaching. Based on the context, we act differently. For example, in the Pac-Mac game, we want to run away from the ghosts. But when we just capture a Power Pellets, we can chase the ghosts and eat them. Our new policy is more versatile to handle different games.

Following is one possibility to train a policy from multiple tasks. We train this combined policy by minimizing the cross-entropy between individual policy and the combined policy.

Modular networks

As a golden rule for software engineers, we break down designs into modules to maximize reusability. In robotic control, policies can be decomposed into task-specific and robot-specific, like, a policy for operating a specific robot and another policy for moving a specific object.


Once, it is trained, we combine them together to form new robot-task combinations

to perform operations that have not been trained before.


Here is the architectural design:


However, we may overfit our models with some specific task-robot combinations and hurt our generalization to others. Deep learning uses capacity reduction and dropout for regularization. In this example, we just need to do it in the output layer of the task module. The reduced capacity at the interface with the dropout improves its transferability to other robot modules.

Contextual policies

To finish a task, we need to know the context. Sometimes, it can be inferred from the observation. For example, when your wife stares at the dishes in the sink, you better know what to do. But sometimes, this context needs to be provided explicitly. For example, where do you want to go or what is your target.

Image source: Peng, van de Panne, Peters

Such context ω actually is needed in determining a policy. But mathematically, we can treat that like additional states.

We can use the experience as a context: