RL — Deep Reinforcement Learning (Learn effectively like a human)

Jonathan Hui
8 min readOct 9, 2018


Photo by pan xiaozhen

Alan Turing said

Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s?

With the brute force of GPUs and the better understanding of AI, we beat the GO champions, and Face ID comes with every new iPhone. But in the robotic world, training a robot to peel lettuce makes the news. Even with an unfair advantage over computation speed, a computer still cannot manage tasks that we take it for granted. The dilemma is AI does not learn as effectively as the human. We may be just a couple of papers away from another breakthrough or we need to learn more effectively. In this article, we start a new line of conversation that addresses these shortcomings. We will also look into major research areas and the challenges that RL is facing.

Imitation learning

Child imitates. Imitation plays a major role in learning. In many RL methods, we analyze how decisions change the rewards we collect. This can be done by understanding the system dynamics better or through smart trial-and-error to figure out what decisions give better rewards. However, with the success of supervised learning in deep learning DL, we can completely ignore them and train a policy imitating experts’ decisions directly.

Unlike other reinforcement learning RL methods, we don’t waste time finding what is promising. We use the demonstrations as our guidance in searching for the solution.

Expert demonstration
Imitation by supervised learning


We never duplicate things exactly. Error accumulates fast and puts us into situations that we have no expert sample to follow.


For a human, we take corrective actions when we drift off-course. But imitation learning learns from training samples. To address that, we can collect extra samples for those off-course situations. We deploy the solution and check what is missing. We go back to the experts to label the correct actions again. Alternatively, we purposely add small noise to our actions during training and observe how experts may react. In addition, for some specific tasks, we can hardcode solutions for known issues. We just need to identify them during training.

Man v.s. Machine

Using human demonstrations is expensive and we need frequent expert involvement to fill the holes. There are situations that the computer can plan a better course of action. By sampling local data, the extra information helps the computer to define the local model and the problem better. It can generate local decisions that may be even better than humans. But these solutions do not generalize well and are vulnerable to changes in conditions and do not provide consistent results. In addition, local decisions can be biased and the accumulated errors hurt. Our fix may lie in deep learning DL. With proper training, DL is good at extracting common patterns and eliminates noisy information. If we can plan the training strategy well, we can have a nice policy by imitating controls planned by the computer. Even individual sampled action may be specialized or not perfect, through DL, we can find the common pattern in solving those problems.


One of the strategies heavily depends on self-training. With no human interactions, collecting a large number of samples becomes economically feasible. With this huge volume of samples, we can discover the fundamental rules in performing tasks. When present with target goals, we use this knowledge with planning to complete them.

During training, we may optionally present a goal for the robot arms to achieve. This goal is not necessarily similar to our final goal which may be too hard and require human involvement to complete. This hurts the self-training progress and reduces the amount of data collected. Therefore, we may present the self-training arms for a much easier goal. Or simply try out pre-programmed semi-random actions. Intuitively, if children can learn enough basic skills, they can utilize them through planning to solve complex problems.

Another strategy is to train the robot with minimum expert demonstrations. This jumpstarts the initial policy search so we would not wander in the wild for too long. But more importantly, this produces expert demonstrations which we can use to develop the model in model-based learning or the reward functions in inverse reinforcement learning.

Inverse Reinforcement Learning

Setting goals are important for any project. Too visionary, no one knows how to achieve it. Too narrow, we may not have the big picture right. Let’s use the GO game as an example. In reinforcement learning, we use the final game result as the only reward giving. This is awfully hard to untangle information to see what sequence of actions benefits us. These infrequent and long-delayed rewards hurt decisions making. For GO champions, they set up intermediate board positions for them to achieve. Not only in reinforcement learning but also in real life, success depends on how well we divide our objectives to measure progress correctly.

Technical speaking, it means the shape of the reward function matters a lot. Consider two cost functions below, the left one gives no direction on where to search. Except when we are almost at the optimal point, any movement does not change the cost. In this scenario, no optimization method will do better than a random search.

The cost function on the right is smooth without vanishing or exploding gradients. It guides us well to search for the optimal. In many RL methods, we take the rewards as is without challenge whether it guides us better. We work crazy hard to find the model or to fit the policy with this far-fetched objective. Alternatively, we handcraft features to calculate customized reward functions. However, this solution does not scale. Likely, after many serious attempts, the reward solution is still not broad enough to model complex problems.

Our solution may fall into DL again. We can use it to learn the reward functions through expert demonstrations. We hope that the deep network can capture the complex rules better.

In Inverse RL, we use rewards to score the likelihood of a sequence of actions. The probability of a sequence of actions is defined as:

The higher the reward, the more likely the decision becomes. To model the reward function, we train a deep network below to predict it. To train the model, we use an objective in maximizing the likelihood of the expert demonstrations.

But computing the likelihood score of all trajectories in the denominator below is very hard.

But most trajectory has negligible rewards. So the denominator can be approximated just using the most promising trajectories.

Let’s see how we train a policy and a reward function in alternating steps.

Given a reward function (top left above), we can refine a policy using a policy gradient method. Then we use the new policy to generate new trajectories and use them to approximate the denominator better. Next, we compute the gradient of the expert demonstration likelihood.

With this reward gradient, we update the reward function parameterized by ψ to increase the likelihood of the expert demonstrations using gradient ascent. We run this process iteratively to improve the reward model and the policy in alternative steps. In short, with a better reward function, we get a better policy. With a better policy, we compute a more accurate gradient to improve the reward function.


Actually, we can view the inverse RL from the perspective of GAN. Our policy generates trajectory. This is the GAN generator. The reward function acts as a discriminator which uses the reward measurement to distinguish between expert demonstrations and the trajectories from the policy.

In GAN, we train both the discriminator and the generator in alternative steps so the discriminator can detect the smallest difference while the generator generates actions that fool the smartest discriminator. With GAN, we learn how to generate trajectories close to the experts.

In fact, it can be mathematically proven that GAN is equivalent to our previous approach if the objective function is defined as what we just described.

Evolutionary methods

We say we want to learn as efficiently as the human. Maybe we should challenge whether RL should focus on the superiority of its computational speed instead. Policy Gradient methods easily take 10M training iterations. At some point, we should ask how close it is from random guessing. The answer is not close but can we close the gap if we can guess smartly. For example, we can start with a random policy. We make many guesses and observe the collected rewards. We select the top 20% performers (say) and mutate our guesses from these top performers. We continue the guesses and refinement. Hopefully, we can find the optimal policy through these smart guesses. These methods usually have extremely simple computation and we can parallelize our guesses easily. The simplicity and high parallelism make this approach appealing comparing with other RL methods, in particular for synthetics graphics.

Reinforcement learning (Recap)

Here is a snapshot on where different reinforcement learning methods emphasize. We either ignore the other components or sampling the results through simulations.

This is the same figure for imitation learning, inverse RL focus, and evolutionary methods.

As mentioned frequently, RL methods are not mutually exclusive. We often do mix and match. For example, Actor-critic merges Policy Gradient with Value-learning, and Guided Policy Search merges Model-based methods with Policy-learning.



Jonathan Hui