GAN is vulnerable to mode collapse and training instability. In this article, we look into Spectral Normalization in resolving these problems. However, to understand Spectral Normalization, we need to review WGAN first. The discussion will involve mathematical concepts that people may not familiar with. We will introduce the terms first and make them approachable later.
In theory, when the generated distribution in GAN and the real distribution are disjoint, the gradient of the cost function will compose of areas of vanished and exploding gradients — the red line below.
WGAN claims that the Wasserstein distance will be a better cost function since it has a smoother gradient (the blue line) for better learning. In practice, with proper tuning, vanilla GAN can generate high-quality images but mode collapse remains a prominent issue. And the real challenge for the Wasserstein distance is how to compute it efficiently. One possibility is applying Lipschitz constraint to compute the Wasserstein distance. WGAN applies a simple clipping to enforce the 1-Lipschitz constraint. WGAN-GP enforces the constraint with a more complex scheme to correct some of the observed problems in WGAN (details on WGAN & WGAN-GP can be found here).
Spectral Normalization is a weight normalization that stabilizes the training of the discriminator. It controls the Lipschitz constant of the discriminator to mitigate the exploding gradient problem and the mode collapse problem. This approach will be computationally light compared with WGAN-GP and achieve good mode coverage.
Earth-Mover (EM) distance/ Wasserstein Metric
Let’s have a quick review of the Wasserstein distance. First, we want to move 6 boxes from the left below to the locations on the right marked by the dotted square. For box #1, let’s move it from location 1 to location 7. The moving cost is defined as the travel distance. So to move box #1 to location 7, the cost equals 6.
However, there is more than one way to move these boxes. Let’s call the moving plan i to be γᵢ. For example, in the first plan γ₁ below, we move 2 boxes from location 1 to location 10. Below, we show two different plans that both have a moving cost of 42.
However, not all transport plans bear the same cost. The Wasserstein distance (or the EM distance) is the cost of the cheapest transport plan. In the example below, the Wasserstein distance (minimum cost) is two.
The formalization of the Wasserstein distance can be extended from the discrete space to the continuous space easily as:
Let’s come back to our example again. Each table cell (x, y) below corresponds to the number of boxes moving from position x to y. After normalizing the cell value to a probability value, the moving cost for γ₁ is computed as:
Wasserstein distance finds the distance for the cheapest plan.
For any valid plan γ, its margins must equal p(x) and p(y) respectively.
In the continuous space,
∏ are the sets of valid plans that its margins equal p(x) and p(y) respectively. γ is one of these plans that models a joint probability distribution for x and y. We want to find one in ∏ that has the lowest cost.
Next, let’s look into Lipschitz constraint.
A real function f : R → R is Lipschitz continuous if
for ∀ real values of x₁ and x₂. Intuitively, a Lipschitz continuous function limits how fast f changes.
If we connect f(x₁) and f(x₂) together, its slope has an absolute value always smaller than K which is called the Lipschitz constant of the function. If a function has bounded first derivatives, it is Lipschitz.
Its Lipschitz constant equals the maximum absolute value of the derivatives.
For a sin function, the absolute value of its derivative is always bounded by 1 and therefore it is Lipschitz. The Lipschitz constant equals one. Intuitively, Lipschitz continuity bounds the gradients and beneficial in mitigating gradient explosions in deep learning.
Graphically, we can construct a double cone with slope K and -K. If we move its origin along the graph, the graph will always stay outside the cone.
The original formularization of the Wasserstein Distance is usually intractable. Fortunately, using the Kantorovich-Rubinstein duality, we can simplify the calculation from
where sup is the least upper bound. We need to find f which is a 1-Lipschitz function that follows the constraint
In the next few sections, we try to give a high-level proof without tedious mathematical terms. But these sections are optional reading.
Dual Problem (optional)
Let’s study the duality problem in the context of linear algebra. In optimization, we can associate a dual problem with a primal problem.
According to the Weak Duality Theorem, z (cᵀx) is the lower bound of v (z≤v). Therefore, the dual problem establishes an upper bound for the primal problem.
Strong Duality claims that the optimal solution for the primal and the dual problem are the same (z* = v*). However, this Strong Duality holds for special conditions only. Optimization problems that can be framed in the form of linear algebra do have Strong Duality (proof). So if the primal problem is hard to solve, we check whether the dual may be easier. That leads to the Kantorovich-Rubinstein duality.
Kantorovich-Rubinstein duality (optional)
So let’s apply this in finding Wasserstein Distance. Our primal and dual problem will be in the form of:
The primal is
This is hard to solve and let’s try to solve the dual instead,
The objective is maximizing f ᵀ Pr + gᵀ Pg while maintaining the constraint f(xᵢ) +g(xⱼ) ≤ Dᵢⱼ.
Since Pr and Pg are non-negative, it makes sense to pick the highest possible value for g in maximizing the objective, i.e. g =-f instead of g < -f. Without further proof, g =-f is what we want in general. Our constraints become
That is, we want to limit the rate change of our function. Therefore, to solve the dual problem, we need f to be Lipschitz constraint.
To find Wasserstein Distance, we maximize the dual problem with the constraint that f is 1-Lipschitz continuous.
The term Prf and Pgf become the expected value for f(x) sampled from distribution Pr and Pg respectively.
Graphically, we want f to have a high value if Pr ≥ Pg (Pθ below) and a low value if Pg < Pr.
In GAN, we train a discriminator and a generator with the cost functions below.
In WGAN, we train a Lipschitz function f.
We optimize w of the critic f with the Wasserstein Distance using the objective.
To enforce the Lipschitz constraint, WGAN simply clips the value of w.
Here is the WGAN algorithm.
However, as quoted from the research paper:
Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic till optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, … and we stuck with weight clipping due to its simplicity and already good performance.
Tuning the value of c is hard. Finding the right value is difficult. A small change can alter the gradient norm significantly. In addition, some researchers believe that the improvement of image quality in WGAN costs the mode diversity. WGAN-GP introduces the gradient penalty to enforce the Lipschitz constraint. However, this increases the computation complexity.
The matrix norm or spectral norm is defined as:
Conceptually, it measures how far the matrix A can stretch a vector x.
According to SVD, a matrix can be decomposed as:
U and V are chosen to be the orthogonal (UᵀU=I) eigenvectors of AAᵀ and AᵀA respectively. Both AAᵀ and AᵀA have the same positive eigenvalues. S contains the square root of these eigenvalues which is called singular values.
The spectral norm σ(A) is the maximum singular value of matrix A (proof).
Let’s construct a deep network with the following formularization.
For each layer L, it outputs y = Ax followed by an activation function a, say ReLU.
The Lipschitz constant (or Lipschitz norm) of function g can be computed with its derivative as:
The Lipschitz constant for g = Ax is the spectral norm of W.
The Lipschitz constant for RELU is 1 as its maximum slope equals one.
Therefore the Lipschitz constant for the whole deep network f is
Spectral normalization normalizes the weight for each layer with the spectral norm σ(W) such that the Lipschitz constant for each layer as well as the whole network equals one.
With Spectral Normalization, we can renormalize the weight whenever it is updated. This creates a network that mitigates gradient explosion problems. But how can we compute σ(W) efficiently since finding eigenvectors are expensive.
The kth power of A with an eigenvector u is pretty easy to compute.
We can express a vector v into a linear combination of the eigenvectors of A and compute its kth power with A easily.
Consider u as eigenvectors of A. We can reorder these eigenvectors with u₁ being the eigenvector with the highest eigenvalue (dominant eigenvector). A vector b can be decomposed into eigenvectors of A. The kth power of A with v will approach:
Starting with b₀ using the following iteration, b will converge to a multiple of the dominant eigenvector. In short, we can start with a random vector b₀. Multiplying it with A, we can approximate the dominant eigenvector.
The same concept can be done with u and v computed iteratively as follows (where A is now written as W).
In SGD, the change of W is small in each iteration. Therefore, the change of the Spectral norm is small also. Therefore, Spectral normalization can reuse the value of u in the previous iteration. Indeed, instead of doing multiple iterations in the power iteration, only one iteration is needed, i.e. we only computer the equation 20 and 21 below once per update only. That significantly reduces the computational complexity. Here is the final algorithm in applying Spectral Normalization.
The Spectral normalization paper offers a mathematical analysis of the effectiveness of this normalization method over others in section 2.3, 3 and appendix D, E & F. It is not easy to understand but I will encourage people to read it if you want to study it in more details.