Machine Learning — MAP inference with Linear Programming & Dual decomposition
Previously, we have discussed max-product inference in the Viterbi algorithm to make an exact MAP inference (Maximum a posteriori).
The Viterbi algorithm applies the concept of message passing to locate the maximum posterior. However, such exact solutions can still be too complex for many real problems. In this article, we switch gear in looking at approximate inference. We can reduce MAP inference to an optimization problem using linear programming. Even though optimization problems are generally NP-hard, optimizing a linear problem is efficient, heavily studied and supported by many software packages. In the second part of the article, we look into another method deploying the divide-and-conquer concept. We approximate the solution by decomposing them into smaller problems using dual decomposition.
MAP inference with Linear Program
Recall that a Markov random field (MRF) can be formulated as:
So, for our MAP inference, we can rewrite our objective function as maximizing θ(X). If we model the Graphical model with unary and pairwise factor only, θ(X) composes a score for each node i and each edge ij. Our objective is
Our next step is to convert this into a linear programming optimization.
where θ will be a vector. For simplicity, we assume xᵢ is binary xᵢ ∈ {0, 1}. Consider a Graphical model with x₁=0 and x₂=1. x is first transformed into μ with the one-hot-vector concept. The upper part contains the one-hot-vector of each node. The bottom part identifies the corresponding edge entries.
θ will be filled with the node and edge potential for every xᵢ combinations. By multiplying θ with μ, we computed the unnormalized factor. Below, we generalize the idea into any graphical model.
Here is another visualization for 2 possible values for μ.
After we transform x into μ, the score for the configuration x equals θ μ. Our original objective is equivalent to:
However, μ cannot be in any value. It has to be integers with the following constraints.
Because of the one-hot-vector concept, only one of the element of μᵢ(xᵢ) is one. i.e. a node can have only one possible value. That is also true when the bottom of the vector indicates a pair of node configuration in an edge. μᵢ(xᵢ)=Σμᵢⱼ(xᵢ, xⱼ) ensures the information in μᵢ is consistent with the corresponding information in μᵢⱼ. Here is the optimization problem with all the constraints:
The last two ensures the edge variables are consistence with the node variables. We call this globally consistent. Here is another demonstration of presenting the MAP as linear programming.
The entries in μ are all integers. So this is an integer linear program (ILP). ILP is still NP-hard (will demonstrate the difficulties later), but we can relax this into a much easy linear program (LP).
Instead of requiring the entries of μ to be integers. We relax the constraints such that it can be within the range of [0, 1]. We solve the corresponding LP, and round the LP solution to its nearest integer value at the end.
Since LP maximizes over a large space, the optimal solution for LP will be greater or equal to the MAP solution, i.e. MAP(θ) ≤ LP(θ). The LP is tight if MAP(θ) = LP(θ). In practice, the LP solution works reasonably well in many cases. Software packages are also available in solving it with LP.
Dual decomposition
Now, let’s look into another approach called dual decomposition. In a Graphical model, we maximize the score or minimize the energy for x. Conceptually, they are the same, just with an opposite sign. In this section, let’s try to minimize energy.
A problem is decomposable if
i.e. the original optimization problem can be divided and solved with smaller and independent problems.
(Credit: Some of the equations are come from or modified from here.)
Many problems cannot be decomposed into two smaller problems with non-overlapping input parameters. But, we can introduce the complicating variable z below and optimize the solution in alternative steps.
In the first step, we optimize x and y in f₁ and f₂ independently and assume z is fixed. Therefore, the optimized x* and y* are functions of z now.
In the second step, we compute the gradient for z and use gradient descent to update z.
We repeat the iterations above in finding the optimal solution. This method is called the primal decomposition.
In MAP inference, the input parameters are discrete. Instead of applying the gradient descent in the second step, we enumerate z to find the lowest objective value for f₁(x*, z) + f₂(y*, z). This is expensive and this is why IPL is still pretty complex. But, if we have problems in solving the primal problem, let’s try our luck with its corresponding dual problem. For those not familiar with Lagrangian, please review the material here first. First, in place of z, we formulate the primal problem with z₁ and z₂ and add a constraint that we want them to be the same in our solution.
The Lagrangian and the dual function will be
We will decompose the dual function as:
If p* and d* are the optimal solution for the primal and the decomposed dual problem respectively, p* ≥ d*. The dual function is a lower bound of p*.
Our decomposed dual problem is
We can use an iterative method to solve it. The first step optimized g₁ and g₂ below.
The second step updates λ with the gradient descent which is computed as:
Intuitively, we move toward a solution that agree with each other. The algorithm for the iterative dual decomposition is
MAP inference with dual decomposition
Let’s come back to MRF. xᶠ will be the z in our decomposition problem. xᶠ contains the nodes in the clique f. xᶠᵢ is just xᵢ. Therefore, we add a constraint to indicate xᵢ in the clique f is simply xᵢ in the graph.
The Lagrangian is
And we will use the following notation from now on.
The dual function is
Now, we can put everything together and this is the dual decomposition algorithm for MRF.
Intuitively, we break down the energy functions into subproblems containing the energy related to nodes and the energy related to the edges separately. We optimize them independently. But the solutions will not agree. So we add an equality constraint to encourage their solutions to agree. To convert the constraint into the objective function, we apply the Lagrangian multiplier. As mentioned before, the optimal solution q* for the decomposed dual problem is a lower bound for the optimal solution p*. Therefore, we should view that as an approximation strategy rather than finding the exact solution.
Credit & reference
Probabilistic Graphical Models