RL — PLATO Policy Learning using Adaptive Trajectory Optimization

Photo by FLY:D on Unsplash

Imitation plays a major role in learning. In RL, it reduces the amount of time in searching for solutions and it is sample efficient. And imitation usually uses supervised learning for training which has good stability.

DAgger: Dataset Aggregation

DAgger is an algorithm to augment the training dataset with expert demonstrations for states that may be visited by the learned policy. So even if the learned policy is drifted away from the expert demonstration, we can take corrective actions from the newly provided expert demonstration.

Source

In step 2, it tries to collect observations (states) that the trained policy may visit. In step 3, it asks the expert again about what actions should be taken for those states.

Imitating MPC: PLATO algorithm

PLATO has two policies. One is the learned policy θ similar to the DAgger and an expert policy. The expert policy is based on a trajectory optimization model which has the form of a linear Gaussian controller:

Step 2 does not explore space using the learned policy θ in the DAgger. Instead, it uses the expert policy, a trajectory optimization model, to explore space.

In step 3, it replaces the human with the computer in determining actions based on the following objective:

The first term indicates the trajectory cost and the second term measures the divergence between the learned and the expert policy. So we want the trajectory cost to be minimal while not too diverged from the learned policy.

In a nutshell, we impose a divergence penalty encouraging us to stay on the learned path unless it costs us more than we want. In such a case, the path will be replanned to reduce cost while not trying to make too aggressive changes.

Source

Every time step, we can use this objective to perform the re-plan and add those trajectories to the training dataset to train the learned policy.

Imitate expert control

In training, we may provide an expert method to create a policy. Such a method may not be available in the deployment. For example, a helicopter may fly over a car to optimize its trajectory better. However, such luxury is not available in the deployment.

We can optimize actions from the expert method and use its generated labels (the trajectories) to train another policy using a different observation.

In the video below, the moving object uses a laser beam to map out the surroundings during training. But during real deployment, we cannot afford those extra sensors.

--

--

--

Deep Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

All you need to know about text preprocessing for NLP and Machine Learning

Adversarial Attacks

Self-Driving Miniature Car Race

Building Autoencoder in Pytorch

10 Main Points to Consider Before Choosing an Anti-Fraud Solution

What is Convolutional Neural Network(CNN) Architecture in Computer Vision?

Reflections on Bayesian Inference in Probabilistic Deep Learning

Implementation of Regularization Techniques (L1 & L2) in Keras

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jonathan Hui

Jonathan Hui

Deep Learning

More from Medium

LSTM Is Back!

Collaborative Denoising Autoencoders on PyTorch Lightning

Google AI sparks a revolution in Machine Learning.

Manifold Mixup: Learning Better Representations by Interpolating Hidden States