RL — Exploration with Deep Learning

Jonathan Hui
13 min readAug 9, 2021


Exploration is an art. In the high-tech world, we value bold ideas but yet we are risk-averse in reality. Releases are filled with low-hanging fruit. We keep repeating our past successes without exploring what the next success should be.

We focus on judging the ideas rather than developing them. We demand answers to all unknowns. But without exploration, those questions will never answer. Our experience is often our worst enemy. New ideas will have flaws. People didn’t want to wait for the DVD movie to come. Yet the Netflix model killed Blockbuster in 2010. In meetings, intellectual reasoning takes over quickly and kills ideas.

The real issue is we are overconfidence on what we know but fail to evaluate the potential of the unknown.

But to be fair, there are executives spending months in blue sky concepts without delivering anything. So what is the right balance? How can we balance risk and reward? How do we choose between exploitation and exploration? How can we act with incomplete information?

This should not be a binary decision. In reinforcement learning (RL), our goal is to create equations to tell us when to explore, when to exploit and how to mix them up. Let’s bring out the famous multi-armed bandit problem.

We have three slot machines with different payouts. After making some bets, here are the expected net gain (Q) for each machine.

The x-axis is the expected net gain Q. The y-axis is the corresponding probability density p(Q). The sharper the curve, the more certain we are with the estimation. Machine 1 (red) seems to be luckier and we play it more often. Hence, we are more certain about its payout. Currently, it has the highest expected gain (0.1). Every dollar we play, we gain 10¢. Many people will continue betting on machine 1 because that is what the data shows. But we fail to consider the uncertainty in the decisions.

Slot machine 3 (blue) has the lowest expected gain but also the highest uncertainty. So it can be the dark horse that we don’t aware of. We may lose money again. But we gain knowledge.

Hypothetically, if we play machine 3 (blue) more, we may even realize that it has the highest payout.

When we decide which machine to play, we can just single out the top 90th percentile of Q on each machine. We compute its average and play the one with the highest value. We continue this strategy to pick our next move.

The choice of using the 90% percentile is simple but arbitrary. For example, we can use the 75% percentile. In fact, there are many proposals under different assumptions.


Let’s define the objective more precisely. Regret is defined as the reward difference between playing the best machine and what we get from our actions.

For the multi-armed bandit problem, we reformulate the equation using the number of times that we take a specific action.

As we keep trying with proper exploration strategy, we will gain knowledge and the regret should decrease over time.

On the contrary, if we take a greedy method instead, there is no exploration. The regret will continue going up linearly. To improve the exploration, we can adopt ε-greedy with a fixed chance of exploring the alternatives. While this reduces the regret, the total regret will not flat out as the exploration will introduce non-optimized decisions. To address the issue, we can have ε decays over time as we gain knowledge in making decisions. Our exploration decreases over time.

Modified from source

Optimistic exploration with UCB1

As we don’t know which is the best machine, let’s redefine the objective again for a more practical purpose. The new objective contains the estimated expected reward Q and the estimated expected uncertainty U. The latter term acts as the exploration bonus. Intuitively, we take actions with the best combination of estimated reward and the possible upside.

We cannot explore forever for unlikely possibilities. We need to establish U's value, which should be sensitive to how often we have been trying a. We need to explore less as we gain more information. Mathematically, the chance of Q greater than an upper bound is:

Let assume p to be:


And U becomes


We should explore less as we progress. So let’s assume p should degrade as t⁻⁴. U becomes:

Modified from source

We establish the upper bound on how far we should explore at a different time step. This is the UCB1 algorithm:

We give more chance to actions that are less taken even it is less promising. This establishes as the exploration bonus. As time progress, we explore less. The upper bound of the total regret in UCB1 is:


This method is a type of optimistic exploration. We try out potentials until we are comfortable that they are not great.

Bayesian UCB

But in UCB1, we do not account for the spread in our sampled reward (the shape of the reward).

We can explore actions with the best optimistic potential in the form of

where C is some constant and σ is the standard deviation. In Bayesian UCB, we model the rewards with a Gaussian distribution.

After taking into account the Gaussian distribution, the best action takes will be:

where c is the multiplier of standard deviation that we want above the mean.

Thompson sampling (Posterior sampling)

In Thompson sampling, we build a reward distribution model for each machine continuously. In each iteration, we sample a value from each machine’s reward distribution. We select the highest value and play the corresponding machine.

In the example above, machine 2 (green) has the largest sampled value. We play machine 2 and observe the reward.

Then we refit the reward model of machine 2 with the Bayesian inference. Bayesian inference (details) uses Bayes’ Theorem to combine what we were believed (prior) and what we observed to form a new belief (posterior). The key point is we form a new reward distribution for machine 2 below.

The new reward distribution shown above has shifted left with higher certainty. The lower expected rewards decrease your chance of being selected but never rule it out. We repeat the process to refine the reward distributions.

In the beginning, all three machines can be initialized with a uniform distribution and have an equal chance of being selected. As the collected samples grow, we become more selective.

Here is the algorithm:


Information Gain

Entropy measures the amount of information. To model random numbers, we need a lot of bits (information). If a variable is either 0 or 1, we only need 1-bit to encode it. Information Gain measures how much information we obtain after making an observation. If the entropy drops significantly, we know this observation makes the end result very predictable, i.e. we gather a lot of information after taking the action.

Modified from source (z is the reward, y is the observation and a is the action)

Based on the information gain and the expected rewards of action, we choose one based on:

Modified from source

i.e. we choose high reward action balance with how much we know it.


Here are the three major exploration strategies we have discussed.

  • Optimistic exploration.
  • Thompson sampling (posterior sampling).
  • Information gain.

Next, we need to extend these concepts beyond multi-armed bandit problems.

Contextual Bandits

In the multi-armed bandit problem, the reward does not depend on the state (the context of the environment). It doesn’t depend on who is playing or the location of the machine. And whatever slot machine we play before, it does not impact the reward we collect next.

In contextual bandits, the state of the environment is important. The reward depends on the state and the action. But similar to the multi-armed bandit, the current state does not impact the payout of the next state.

For example, the advertisements pushed to a reader on a web page are sensitive to the reader. We maximize the reward according to the reader profile (the state of the agent). But similar to the multi-armed bandit, the advertisements displayed for the next HTTP request are independent of the last request.

We measure the expected reward of taking an action at a specific state. (a.k.a. Q-value function)

We use a function Φ (say a deep network) to extract features. For simplicity, we use a linear model, parameters θ, to estimate Q from these features. If we extract five features from the state, θ is simply a vector with five parameters.

We can solve θ using the equation below:

The UCB for the contextual bandits is:

Exploration in RL

Multi-armed bandits can be viewed as a simple 1-step stateless RL problem while contextual bandits is a 1-step RL problem.


But RL in general, the exploration can be far more complex. For the remaining sections, we will focus on the exploration of more complex domains.

Exploring with pseudo-counts

Recall that UCB1 is formulated as:

which the second term is the exploration bonus. Conceptually, we use the frequency of visits as an inverse indicator for the exploration bonus. However, N above does not account for the state. For contextual bandits, we reformulate the equation to take the state into account.

where B is any function suggested by optimistic exploration methods like the UCB1. i.e.


Here are other alternatives:


However, if we use the raw image as the state, N(s) will be mostly 0 as we can hardly find two images to be exactly the same.

Instead of using an accurate count, we should use a pseudo count. Intuitively, similar states should be grouped and counted together. To compute pseudo count N(S), we train two networks, one before a state is visited (θ) and one after (θ’). These models estimate the density

which equals:

Then we use the equations above to solve N(s) as:

With all collected states, we can fit both models. Here is the full algorithm for exploring using pseudo-counts.


Counting with Hash

Alternatively, we can use CNN, modeled by 𝜙, to extract features (the representation in blue). Then we use this as the index (say a hash index) for N as in N(𝜙(s)). The objective of our training is to produce the best compressor in representing the raw image.


Exemplar models

Previously, we estimate a count in measuring how often we have explored a state. Alternatively, we can model directly how “new” this state is to us. Have we due with this state often?

In the exempla model method, we train a discriminator to tell whether it is different from the rest of the data we saw. If the discriminator can tell the difference of a sample from others easily, it is novel. For a novel state, we want to increase its chance to be explored.

Modified from source

So if the discriminator D can discriminate s from the samples in all collected states, the density p(s) should be close to 0.


Here is the algorithm:


And equation 1:


Exploration by random network distillation

This method simply answers whether if a state is novel or not! We have two models and f*. f* is a target function. It is modeled with 𝜙 which is randomly initialized.

We are going to train θ to match this target function f*. “What is f*” is not important. We simply set it to be a random function reflected by its random parameters. For two similar inputs, f* and should give similar answers. Since we try to train θ to match f*, any state in which f*(s, a) is very different from fθ(s, a) will be considered a novel.


Therefore, we can use ε below as the exploration bonus.

Posterior sampling in deep RL

Previously, we apply Thompson sampling in solving the multi-armed bandit problem.


We sample p(θᵢ) for each machine and select one with the highest value. How can we apply Thompson sampling in RL? In Thompson sampling, we sampling from p(θᵢ) which models the reward. In RL, we model from Q-function instead. Here we apply the Thompson sampling concept with the Q-value function.


Bootstrapped DQN

In the ε-greedy method, we sample actions randomly sometimes to help exploration. Nevertheless, many problems require deep exploration and such random actions can destroy the training progress. Model parameters oscillate and do not converge.

As said before, overconfidence is a major hurdle in exploration. In RL, the model in the early phase and the training data itself introduce errors or noise that help exploration. As said before, overconfidence destroys exploration. Hence, to improve exploration, we can deploy techniques to avoid overconfidence.

One possibility is to train K different models with different datasets. For each episode, we randomly select one of the models to be used throughout the whole episode. So we will have a more coherent strategy for our actions rather than introducing some random actions. Since different episodes may use different models, we open ourselves to a different perspective and being less overconfident to a single perspective (a training model).

Bootstrap is a simple technique of generating a new dataset by random sampling the original dataset with replacement. In bootstrapped DQN, the concept is similar but with a different execution strategy. With each new (state, action, reward, new state) that we sample and observe during the training, we add it to a subset of these K datasets only, determined by a random policy modeled by a distribution M. This policy is not important for our discussion. We only need to know all these K datasets have overlapped data but they are not the same. As these K models are initialized randomly, we expect the trained model will give us the K different perspectives that we may hope for. Here is the Bootstrapped DQN algorithm:


But training K different models can be very expensive. Alternatively, all these models can share a common core with a different head.


Information Gain in RL

In our discussion, IG can be generalized as:

Modified from source

To improve exploration, we can take actions that lead us to the largest information gain on rewards, i.e., we choose z to be the reward. However, in RL, rewards are sparse and may change only after a long sequence of actions. So it may not be a good candidate to improve exploration in RL.

Alternatively, we can measure the change in state density before and after an action (the state before and after an action).

A major change implies the state is novel. We want to explore more.

z can also be the parameters modeling the state transition in an MDP. Actions that lead to large changes in the transition model, we want to explore more. Information gain can be expressed with KL-divergence.

In VIME, it is expressed as:

Here is the algorithm. This is solved using variational inference. The complexity is beyond our scope and please refer to the original paper for details.