# Motivation of GPS

We take it for granted for the tasks shown in the video. Supervised learning, Policy gradient, or Imitation learning depends heavily on sampling to train the deep network. This is daunting for the time it takes. What challenges are we facing in reinforcement learning RL for the robotic control?

• Robotic control uses the camera to observe the external environment. Inferring actions from high-dimensional tangled data is hard.
• Executing partially learned but potentially bad policy puts robots into harm’s way.
• Potential guesses may be ill-conditioned.

# Guided Policy Search GPS

Model-based RL plans optimal actions based on the cost function and the model f.

DGD optimizes an objective under a constraint C:

# Deterministic GPS

Let’s get through an example with a deterministic policy. First, we are going to simplify the notation a little bit. When we write the objective as,

# Intuition

A controller determines what actions to take. We run the controller on a robot and observe the trajectories. We use them as expert demonstrations to train a policy. This policy serves the same objective as the controller. But it is modeled with a deep network while the controller uses a trajectory optimization method to determine actions.

# Imitate optimal control

To have accurate trajectory planning, we need to measure all the relevant states precisely. During training, we can deploy extra sensors and set up a controlled environment to collect those needed states, in particular for the external environments. Under these setups, it is not hard to plan the optimal controls. But how can we create a solution that does not require such heavy lifting in the deployment? For example, we can train a self-driving car with Lidar and video cameras. But can we skip the Lidar in the deployment?

# Stochastic GPS

Previously, the controller and the policy is deterministic. How can we use GPS with a stochastic controller and policy? Actually, this requires very small changes.

# Augmented Lagrangian

Next, we will look into a few methods to improve GPS. We add an additional quadratic penalty for the divergence between the controller and the policy. The new Lagrange becomes

# Multiple trajectories

We want to optimize multiple trajectories in parallel instead of a single trajectory. The benefit is similar to the mini-batch.

# Note

This article was written a few years ago. I have not reviewed this lately. Please put messages in the comment section for any needed updates.

--

--

--

## More from Jonathan Hui

Deep Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Deep Learning