RL — Tips on Reinforcement Learning

Jonathan Hui
14 min readFeb 6


Photo by Sam Truong Dan

Deep Learning (DL) is hard to train and reinforcement learning (RL) is much worse. In early development, follow the same strategy for DL: keep things simple! Remove any bell and whistle that get in your way and reduce the uncertainty to a minimum. In specific for the RL,

  • For new models and algorithms, pick simple toy experiment(s) for early development.
  • Simplify the problem first so we can run experiments easily and fast.
  • Patience with hyperparameter tuning. RL is very sensitive to hyperparameters (worse than DL).
  • Try different random seeds.

Aim low. Always work from something that is working.

Set up reference

Many model-free algorithms have a long warmup period before showing any sign of progress. Often, they take millions of iterations before seeing some promising moves. To tackle the uncertainty, one effective method is to get familiar with toy experiments using other RL methods. Then we use them to cross-reference our progress. For example, are the actions or the rewards look normal at this time of the training? As an example below, the plot shows the aggregated training progress using different DQN methods on 57 Atari games. This gives us some guidelines on when to continue the training and when to start the debugging in developing DQN-like methods.


Input features

Make tasks easier to solve first. If learning from raw pixels is slow, use handcrafted features first. For example, use the states gathered from the robot arms or the observed object locations instead of inferring from raw pixels. The high dimensionality of images adds significant complexity to the problems. Atari games have relatively simple game rules and work well with CNN networks in extracting generic features. That may not hold true for other RL tasks.

RL is not Exactly Deep Learning

Unfortunately, many successes in DL like supervised learning are not easily duplicated in RL. Let’s cover some of the issues.


One major difference between DL and RL is the data distribution of its training input. In DL, we randomize the input such that each batch of training data contains a good balance of different class of objects and each sample are independent of the other. We cannot predict what we may see next from the previous samples.

We called this i.i.d. (Independent and identically distributed). We want samples to be identically distributed. i.e. the data distribution for each batch of samples should be similar. Below, all the samples are from the same class. Its input data distribution is strongly biased towards the object class “0”. Therefore, it is NOT i.i.d.

This sample batch is bad for supervised learning. We don’t need to learn any features. Instead, we can simply predict the output as “0” regardless of the input to reduce the training loss.

Policy-based method in RL

Samples in RL can be highly correlated in the same training batch. The space to explore is heavily dependent on the current policy. As we know better, we change where to explore. The input data distributions across batches are therefore constantly evolving. The training samples between batches are not identically distributed. In addition, for value learning methods, the output target value is changing as we know things better. RL is nowhere close to being i.i.d. This is a nightmare and creates a few significant challenges for RL:

  • The batch normalization and the dropout method may not work for RL.
  • It is hard to adjust the learning rate for proper convergence. Hence, we need more advanced optimizers like AdamW or RMSProp. For Policy gradient methods, look into methods like PPO that take advantage of the trust region.
  • We need to slow down the changes in input and output to give them a chance for the model to learn and evolve.

Overfitting Data v.s. Overfitting Task

In DL, we use regularization to avoid overfitting the data. For RL, we need to think at a higher level.

First, we need to train the system with many scenarios. For example, to train a droid to fly indoors, we should create as many rooms and furniture configurations as possible. To maximize such scenarios, we may need to create synthetic data just for training purposes.


Diversity always helps! The following video trains several types of objects with a diverse set of terrains and obstacles. As the environments get more diverse, we may expect that it is impossible to train the model well. But instead, it avoids overfitting a particular task and starts exploring the complex behavior underneath all these scenarios. So the complex environments enhance, not deter the training. But as discussed before, we should keep things absolutely simple in the early development. Suggestions like this should be done once the coding is fully debugged first.

Second, train the models with different tasks.

For the Atari space invader game, when the alien fire at us, we run away. If we only trained with this game, our solution will not generalize well. For example, in the pong game, we want to hit the ball but not run away. By training with multiple tasks, we gain better and more fundamental knowledge. We should be alerted when an object is approaching. But based on the context, we act differently. For example, in the Pac-Man game, we want to run away from the ghosts. But when we just capture a Power Pellet, we chase the ghosts and eat them.

Don’t overfitting the task. Train with many tasks to have a fundamental understanding of how things work.

DQN paper usesε-greedy policy with ε equals 0.05 even during the testing to avoid overfitting.

In general, we can overfit a task for an environment. However, without verifying it with other scenarios and tasks, the highly tuned solution will unlikely work in other situations. So do not commit extensive hyperparameter tuning without such verification first. Target the design to be less sensitive to hyperparameters. Super-sensitive hyperparameters usually do not generalize well.

Bigger is not necessarily better

We cannot blindly increase the capacity of the deep network because it risks overfitting. Solutions in handling overfitting and exploding gradient problems in DL may not be applicable in RL.

Model-based RL takes fewer samples to train and is particularly vulnerable to overfitting. Therefore, those models need to be far more simple. Unfortunately, this limits the expressiveness of the model and creates the chance of sub-optimal solutions.

In RL, the bottleneck is often in sample efficiency, stability, and convergence. Hence, designing a powerful and expressive network may take a second priority. This is particularly true if you do not have the patience to tune the model.

Local Optima

RL suffers from local optima much worse than DL. In the video below, once the half cheetah is stuck in the upside-down position, it fails to walk upward again even though it can run faster.

First, we can try different random seeds. Different random seeds may reach different local optima. We can get very different performance results by simply changing the random seeds. As shown below, the difference between the performance below is from the random seeds only. Don’t underestimate its impact! Many runs can have low rewards just because of the random seeds.

Averaged over two sets of 5 different random seed. Source

Always test algorithms or models over multiple tasks with different random seeds.

Second, try a better exploration scheme during training. The half-cheetah problem indicates our exploration is too short-sighted. Increase the chance of exploration versus exploitation.

Third, encourage the diversity of actions such that we may break out from the local optima. For example, add an incentive in the objective function to encourage a higher entropy for the actions.


It also helps us in adapting to environmental changes better or breaks out from gridlock.


Hyperparameter tuning

The convergence of many RL methods is far worse than DL. In some RL methods, like value-learning with a deep network approximator, the training can be unstable. To address these shortcomings, we add new incentives or penalties to the objective function.

In general, we need more patience in RL than in DL. Often, RL methods work with a narrow range of hyperparameters and require an extensive search to locate them. This is why, as mentioned before, setting up some reference points is important. For hyperparameter search, a random layout can be used (i.e. searching parameters randomly), in particular for high-dimension space.


Don’t be overconfident or pessimistic over a single task result unless the improvement is unusual. It is hard to find a single RL method to work well across all tasks. DQN is good at the Atari games but performs badly on continuous control like the CartPole. If one does not work well in one task, it does not imply it will fail others or vice versa.

Start with simple toy experiments. Switch to others if there is no progress. Afterward, experiment with moderate-size problems. Construct experiments to prove what the algorithms are good at and to analyze the weakness.

Previously, we suggest tuning the hyperparameters patiently. This always creates a dilemma on whether we should tune the model further or try something new. A better approach is to automate the benchmarking process so results can be tested and verified in parallel.

Continue benchmarking of your algorithm among different tasks.

Reshape reward function

Reformulate the reward function so it gives constant and better intermediate feedback. For example, instead of giving rewards just when an object reaches a target, establish finer grain goals. For example, give rewards as the gripper gets closer to the ball. This gives more learning signals to help the training.

But reshaping the reward function may lead to a sub-optimal solution. In the video below, instead of finishing the course to get a grand prize, it loops forever in collecting the Turbo rewards. So care must be taken! Just like deep learning, what the system may learn can be a big surprise. So take time to evaluate what it tries to achieve under the new reward function.

Feasibility study

When we downsample the images or decrease the sampling frequency, information will be lost. We may take a look at the images ourselves to ensure it retains enough information to solve the problem. If we cannot solve it, the RL methods may not be either. Also, run a random policy on the problem. Check whether we may see some desired pattern of behavior once in a while.

Data preprocessing

Similar to DL, we want input features to be zero-centered. Apply

to clip the outliners and standardized the input. Use a running estimate for μ and σ with all data seen so far. Do not just use a single batch of samples to calculate them. Samples in the same batch are highly correlated in RL. The calculated values do not represent the mean and standard deviation. We may want to standardize the prediction also. However, for rewards, just rescale it and do not change the mean.


Like DL, people have a tendency to act before gathering enough information. It takes a long time to verify a guess and in my experience, many wrong conclusions are drawn. Creating a well-controlled environment for a fair comparison is not obvious. For instance, we will get different results from different random seeds. So some observations can mislead us easily. Make educated guesses based on data. Always duplicate and verify your information first.

In RL, we should monitor (when it is applicable):

  • value functions,
  • average rewards,
  • state visitation and distribution,
  • policy distribution,
  • norms of gradients,
  • size in the parameter updates,
  • changes in target values,
  • policy entropy, and
  • KL-divergence of new and current policy.

Visualize input features, output, and rewards in histograms. Verify the scales and that it is centered properly. Identify and remove any outliers.

Visualize data and metric over time. Plot histogram of the data collected.

Identify any misbehavior like parameter oscillations, gradient exploding/vanishing, and non-smooth value functions. Check if the value function (if applicable) predicts the real rewards well. Compare the result with baseline methods like the Policy Gradients, and Q-learning (like those in OpenAI baseline repository).

Monitor training progress

Monitor the min/mean/max/standard deviation of an episode return. Large variance in late training implies the training is not stable. Episode time is another way to measure progress. Even though you may lose every time, a longer episode implies you are making progress.


Next, we discuss how to tune some parameters specifically.

Batch size/Buffer size

RL often uses a bigger batch size than DL. Large variance destabilizes training so we use large batch sizes to smooth out the variance. Increasing the batch size if there is no progress in learning. The batch size is 100K for TRPO on Atari games.


For DQN on Atari, the update frequency between the current and target network is 10K and the replay buffer size is 1M transition frames. If the batch size of the buffer size is not large enough, the noise will overwhelm the training. Bigger batch size will improve the performance but it slows down the computation.

Discount factor γ

γ needs to be large enough to collect rewards. For example, if γ=0.99, it will ignore rewards that are 100 timesteps away. But if the rewards are given more frequently, γ can be lower.

If TD(λ) is used in calculating rewards, we can use a higher value for γ. The λ value blends TD with Monte Carlo to reduce variance. Monte Carlo has no bias but has high variance. On the other hand, TD has a high bias but low variance. As λ decreases from one to zero, we move towards TD than Monte Carlo. In practice, we want mainly Monte Carlo result with some minor help from TD. In this paper, one of the toy experiments achieves the best performance when γ is 0.98 and λ is 0.96.

Action frequency

We do not need to change actions for every video frame. We can skip frames before taking the next actions. But we need to verify its impact with a real player first. If the human has a tough time skipping so many frames, the program will likely have similar hardships.

Skipping frames actually increases exploration. With skip frames, we are not following the script (policy) every time. We explore more as we skip more frames. We want to adjust this value to see how the exploration may do.


Many design modifications have similar effects and become redundant. For example, many methods have the effect of normalizing the input or making the optimization more stable. Remove them one by one to simplify the design if it shows no performance degradation. Simple design will generalize better for other tasks.

Tips on Policy Gradient training


Monitor the policy entropy closely.

Entropy is a measure of randomness. High entropy links to high randomness. If the entropy is low, the policy is very deterministic and there is little exploration. It is bad for early training. It should not be too low at the beginning or too high at the end. If the training collapse to a deterministic policy pre-maturely, add an entropy bonus to the objective to encourage exploration. We can also restrict the entropy drop (likely through the trust region) to avoid aggressive changes in the policy.


Aggressive policy change increases the chance of bad decisions. Monitor the KL-divergence closely between the old and the new policy. Compare them with toy experiments using established methods. KL-divergence of 0.01 is reasonable and 10 is abnormal. For large KL-divergence or large spike, introduce a larger KL-divergence penalty.

Monitor the KL-divergence.

Explained variance

In many situations, we try to make predictions with the same mean as the ground truth but also the same variance. To achieve that, we need to introduce a new metric called the explained variance.

The explained variance is defined as:

Say the expected return (empirical return) is zero-centered with a variance of one. If a model constantly predicts zero for any situation, the explained variance above is zero. For a poorly performing model, the explained variance can be negative. On the contrary, if our prediction is right on the spot, the explained variance will be one in our example. So monitor it closely.

Explained variance measures how good our value-function estimation is.

Policy Initialization

Parameters initialization for the policy model is even more important than supervised learning. It determines how we explore the environment. In AlphaGo, it used supervised learning to pre-train the policy first. Even though it is later dropped in AlphaGO Zero, it shows how a head start may help us in moving a project forward. Applying past experience, like through transfer learning, may take out a lot of unknowns in project development. This can help a lot.

But when such help is not available, the final layer output for the policy should be zero or very close to zero. It maximizes the entropy and the exploration rather than having a preference on what actions to take.

Tips on Q-learning

For Q-learning, tune the following areas:

  • Experience replay memory buffer size: Q-learning can use a large experience replay to stabilize the training. The DQN paper stores the last 1M video frames for the Atari games. It is worths some experiments on the buffer size but watches out for the total memory consumption.
  • Learning rate schedule (how the learning rate is decayed over time).
  • Exploration schedule (e.g. the ε in the ε-greedy method). Start with high exploration and reduce it gradually.

Q-learning will need more patience compared with Policy Gradient methods. DQN converges slowly with a very long warmup period in the beginning that shows no sign of progress. Test the implementation on simpler tasks to prove the code is working first. On the Atari game, it takes 10–40M frames before finding a policy that looks better than random actions. In addition, value-learning methods have no guarantee of convergence when a deep network approximator is used. It tends to be more sensitive to hyperparameters and extensive searches are often required.

Credits and reference

John Schulman’s lecture on “The Nuts and Bolts of Deep RL Research

UC Berkeley Deep reinforcement learning

UC Berkeley Deep RL Bootcamp

David Silver UCL course in RL

Book resource

Sutton & Barto, Reinforcement Learning: An Introduction

Dimitri Bertsekas, Dynamic Programming and Optimal Control

Martin Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming