RL — PLATO Policy Learning using Adaptive Trajectory Optimization

Jonathan Hui
3 min readMay 10, 2022

--

Photo by FLY:D on Unsplash

Imitation plays a major role in learning. In RL, it reduces the amount of time in searching for solutions and it is sample efficient. And imitation usually uses supervised learning for training which has good stability.

DAgger: Dataset Aggregation

DAgger is an algorithm to augment the training dataset with expert demonstrations for states that may be visited by the learned policy. So even if the learned policy is drifted away from the expert demonstration, we can take corrective actions from the newly provided expert demonstration.

Source

In step 2, it tries to collect observations (states) that the trained policy may visit. In step 3, it asks the expert again about what actions should be taken for those states.

Imitating MPC: PLATO algorithm

PLATO has two policies. One is the learned policy θ similar to the DAgger and an expert policy. The expert policy is based on a trajectory optimization model which has the form of a linear Gaussian controller:

Step 2 does not explore space using the learned policy θ in the DAgger. Instead, it uses the expert policy, a trajectory optimization model, to explore space.

In step 3, it replaces the human with the computer in determining actions based on the following objective:

The first term indicates the trajectory cost and the second term measures the divergence between the learned and the expert policy. So we want the trajectory cost to be minimal while not too diverged from the learned policy.

In a nutshell, we impose a divergence penalty encouraging us to stay on the learned path unless it costs us more than we want. In such a case, the path will be replanned to reduce cost while not trying to make too aggressive changes.

Source

Every time step, we can use this objective to perform the re-plan and add those trajectories to the training dataset to train the learned policy.

Imitate expert control

In training, we may provide an expert method to create a policy. Such a method may not be available in the deployment. For example, a helicopter may fly over a car to optimize its trajectory better. However, such luxury is not available in the deployment.

We can optimize actions from the expert method and use its generated labels (the trajectories) to train another policy using a different observation.

In the video below, the moving object uses a laser beam to map out the surroundings during training. But during real deployment, we cannot afford those extra sensors.

--

--