RL — Natural Policy Gradient Explained
Policy Gradient methods PG are popular in reinforcement learning RL. PG increases the chance of taking actions that have good rewards (or vice versa). But it faces a few challenges including bad sample efficiency and poor convergence.
- Sample efficiency measures the amount of samples needed to improve a policy. For simulation in the real world, sampling can be expensive.
- Convergence is about whether the policy will converge to an optimal policy.
In this article, we introduce the natural policy gradient which converges the model parameters better. It avoids taking bad actions that collapse the training performance.
Natural Policy Gradient
Natural Policy Gradient is based on Minorize-Maximization algorithm (MM) which optimizes a policy for the maximum discounted rewards. In the figure below:
- θ parameterizes our policy model,
- η is the discounted reward function and
- θ* is the optimal policy we want to find.
First, we start with a random or educated policy as the current policy.
- (Left figure above) We establish a lower bound M, shown as blue, for the discounted reward function η and it approximates η locally at the current policy (the black dot).
- (Middle figure) Then, we optimize M and uses the optimized policy as the current policy.
- (Right figure) We re-establish a new lower bound M at the current policy and optimize it again.
These iterations continue as we optimize a lower bound function every time until the current policy converges.
We define the optimization of the lower bound function as a trust region optimization.
i.e. we are only interested in finding an optimal point for L within a trust region (for Δθ within δ). Points outside the trust region may have higher calculated rewards but we will ignore them because we cannot trust the accuracy of the improvement.
Let’s detail the objective more:
where L is defined as:
Let’s break down the equation and understand it one piece at a time.
A(s, a) is the advantage function which is the expected rewards minus a baseline. As long as the baseline is not depending on our model parameters, optimizing A is the same as optimizing the expected rewards w.r.t. the model θ.
We often use the value function V(s) as the baseline to reduce the variance of the calculation. Therefore, A can be viewed as measuring how good an action is relative to the average action.
We formulate the reward calculation using importance sampling. i.e. the total rewards of a policy can be estimated by a ratio based on another policy.
This reformulation is not very important for our discussion now. The key point is L reflects the total expected rewards for the optimization purpose.
We want to optimize L within a trust region, i.e. the difference (Δθ) between the current policy and the new policy should be within a distance δ.
Since the policy space is not flat, the trust region distance is measured with a metric tensor F (a matrix) instead of the Euclidean distance (a square distance). The geometry we learn in high school assumes the earth is flat! It will take some time to get used to the new way of measuring distance.
For flat space, F is just an identity matrix:
In general, F measures distance along a curved surface (or a flat one). We will go into the detail calculation of F a little bit later.
Trust region
Let’s dig deeper on why we need this trust region in optimizing an objective? Intuitively, L approximates the expected rewards but its accuracy decreases as the current policy and the new policy diverge. Theoretically, we can calculate a trust region which the optimal point within this region will always perform better than the current policy. However, if it is outside this region, the bet is off. This guarantee is one of the key achievement of natural policy gradient. Now, we have some certainty that we do not make bad moves. All moves will improve the policy and if given enough time, we can converge to at least a local optimal point.
Sound magic but the TRPO paper proves it mathematically that such region exists and can be calculated (with some reasonable assumptions). TRPO proves that this new policy will be guaranteed better than the current policy until it reaches a local optimum. In practice, the computed δ is too conservative, and therefore, we treat it as a hyperparameter so we can tune later.
The optimal policy find in the trust region guarantees perform better than the current policy unless we reach a local optimum.
Optimization solution
In the TRPO paper, the constraint in our optimization problem
can be further elaborated using the mean of KL-divergence.
where KL-divergence is defined as:
KL-divergence measures the difference of two policies. As shown later, it remains closely related to the metric tensor F.
To approximate the objective, we use Taylor’s series to expand the objective 𝓛 and the expected KL-divergency constraint up to the second-order. The second-order term of 𝓛 is much smaller than the KL-divergence term and will be ignored.
After taking out the zero values:
where g is the policy gradient (the gradient of rewards — same as the Policy Gradient methods) and F is the metric tensor mentioned before which is computed as the second-order derivative of the expected KL-divergence between the current and the new policy. F measures the sensitivity (curvature) of the policy relative to the model parameter θ.
After all the maths, our objective can be approximated as:
Intuitively, the reward difference between two policies is approximated by the policy gradient g. For it to be accurate enough to guarantee a better policy, the distance between the new policy and the current policy cannot be different more than δ.
This objective can be solved analytically:
We rewrite the equation as:
and β depends on δ, F, and g.
The term
is the natural policy gradient.
This policy gradient brings us to the next policy which guarantee to be better than the current policy. Using it as the inner loop optimization for the MM algorithm, we can eventually find the optimal policy!
Note F is a Hessian matrix and therefore often written as H also. Next, let us look into F more.
Fisher Information matrix FIM
F is the Fisher Information matrix, a second-order derivative of the KL-divergence.
F acts as a metric tensor (a distance function) measuring the distance between two points in the policy space. This is the same tensor matrix we mentioned in the trust region.
What is the advantage of this natural policy gradient over the policy gradient g?
We want to limit the policy change but how can we translate it to the policy model parameters (note: parameter space ≠ policy space). To turn a car by 5°, how much should we turn the steering wheel? Different car models will be different. So how can we limit policy change for any model designs?
Let’s take a break on some basic concepts first. We have many ways to represent a point including the use of the Cartesian coordinates (x, y), or the polar coordinates (r, θ). The following maps values from the polar coordinates to the cartesian coordinates.
where A is a matrix transforming one coordinate system (polar) to another (Cartesian). It verifies whether two points represented in two different coordinates are the same. For example, by transforming (1.414, 45°) using the matrix A, we can prove that it is the same as (1.0, 1.0) in the Cartesian coordinate. In policy gradient methods, we parameterize the policy model with θ. These parameters form a “coordinate system” in representing the policy.
The components of vectors are different using different coordinate systems. But we can transform from one coordinate system to another using the transformation matrix A and determine whether they are the same.
What is this Fisher Information Matrix FIM? In the Euclidian space, where space is flat, the distance between two points can be calculated as:
These equations can be generalized to measure distance including non-flat space:
where G is the metric tensor and Δw is the difference of two vectors. The metric tensor for the Cartesian and polar coordinates are:
There is another way to calculate G by computing how distance d changes relative to its parameters (x, y), or (r, θ). This is more complex and less obvious but it can be generalized for any coordinate system. The Fisher Information Matrix F is an example of this method.
F is an approximation of the metric tensor G in the policy space. It measures the curvature (sensitivity) of the policy relative to the model parameter θ. It calculates the policy difference of two points represented in the parameter space. In reverse, the inverse matrix of F maps the difference in the policy space to the parameter space. Therefore,
The Fisher Information Matrix maps between the parameter space and the policy space.
Now, back to the question why the natural policy gradient is better than policy gradient. Gradient descent is very popular in deep learning because it has a very low computational complexity. But it is not necessarily very accurate. Once we determine the deepest slope, we use the learning rate to determine how far to go. If the step is too small, we learn too slow. If it is too big, the new policy may take actions that lead us off the cliff (the left photo below). This is particularly bad in RL than deep learning. Bad actions lead to poorly performed states. As we resume the exploration, we start with a bad location with a locally bad policy. It destroys our training progress and hurts convergency badly.
To prevent this, we apply the trust region method which guarantees a better policy. Just like the middle photo above, the new actions will never take us off the cliff. However, how can we translate the policy change into model parameters? How can we avoid different model parameterization resulted in different actions (right figure above)?
The gradient ascent’s parameter updates depend on the learning rate which is unrelated to the policy-model mapping. Therefore, it produces different policy changes for different parameterizations. As the parameterization of our model is a relatively arbitrary choice, we need an optimization method to be model invariant. i.e. the solution works consistently with any models.
Natural policy gradient produces the same policy change regardless of the model parameterization.
If matrix A maps the parameters between a model v to another model w (δv → δv) with:
Even, the calculated natural policy gradients are different for both models,
They should represent the same entity, i.e.
The proof here shows that is true for the natural policy gradient (it is not true for the regular policy gradient). The transformation matrix A is the derivative of one coordinate system relative to the other as underlined as red below:
The proof above proves that the natural policy gradient is covariance. In technical term, it is basis-independent or coordinate system independent. The component values are different but after the transformation, it shows that they represent the same entity. If we replace F with a metric measuring the Euclidean distance, like the regular policy gradient, the gradients will not be covariance. In those cases, parameter changes lead to different policy changes in different models.
The natural policy gradient is independent of the model parameterization θ.
More thoughts
Natural Policy Gradient improves the training by having better guarantee of our policy changes. But Natural Policy Gradient is a second-order optimization method which is very expensive. Next, we check out how ACKTR and PPO can apply the natural policy gradient concept with complexity closer to first-order optimization method, like gradient descent.