GAN — Spectral Normalization

Image for post
Image for post

GAN is vulnerable to mode collapse and training instability. In this article, we look into Spectral Normalization in resolving these problems. However, to understand Spectral Normalization, we need to review WGAN first. The discussion will involve mathematical concepts that people may not familiar with. We will introduce the terms first and make them approachable later.

In theory, when the generated distribution in GAN and the real distribution are disjoint, the gradient of the cost function will compose of areas of vanished and exploding gradients — the red line below.

Image for post
Image for post
Source

WGAN claims that the Wasserstein distance will be a better cost function since it has a smoother gradient (the blue line) for better learning. In practice, with proper tuning, vanilla GAN can generate high-quality images but mode collapse remains a prominent issue. And the real challenge for the Wasserstein distance is how to compute it efficiently. One possibility is applying Lipschitz constraint to compute the Wasserstein distance. WGAN applies a simple clipping to enforce the 1-Lipschitz constraint. WGAN-GP enforces the constraint with a more complex scheme to correct some of the observed problems in WGAN (details on WGAN & WGAN-GP can be found here).

Spectral Normalization is a weight normalization that stabilizes the training of the discriminator. It controls the Lipschitz constant of the discriminator to mitigate the exploding gradient problem and the mode collapse problem. This approach will be computationally light compared with WGAN-GP and achieve good mode coverage.

Earth-Mover (EM) distance/ Wasserstein Metric

Let’s have a quick review of the Wasserstein distance. First, we want to move 6 boxes from the left below to the locations on the right marked by the dotted square. For box #1, let’s move it from location 1 to location 7. The moving cost is defined as the travel distance. So to move box #1 to location 7, the cost equals 6.

Image for post
Image for post

However, there is more than one way to move these boxes. Let’s call the moving plan i to be γ. For example, in the first plan γ₁ below, we move 2 boxes from location 1 to location 10. Below, we show two different plans that both have a moving cost of 42.

Image for post
Image for post

However, not all transport plans bear the same cost. The Wasserstein distance (or the EM distance) is the cost of the cheapest transport plan. In the example below, the Wasserstein distance (minimum cost) is two.

Image for post
Image for post

The formalization of the Wasserstein distance can be extended from the discrete space to the continuous space easily as:

Image for post
Image for post

Let’s come back to our example again. Each table cell (x, y) below corresponds to the number of boxes moving from position x to y. After normalizing the cell value to a probability value, the moving cost for γ₁ is computed as:

Image for post
Image for post

Wasserstein distance finds the distance for the cheapest plan.

For any valid plan γ, its margins must equal p(x) and p(y) respectively.

Image for post
Image for post

For example,

Image for post
Image for post

In the continuous space,

Image for post
Image for post

are the sets of valid plans that its margins equal p(x) and p(y) respectively. γ is one of these plans that models a joint probability distribution for x and y. We want to find one in that has the lowest cost.

Next, let’s look into Lipschitz constraint.

Lipschitz continuity

A real function f : RR is Lipschitz continuous if

Image for post
Image for post

for ∀ real values of x₁ and x₂. Intuitively, a Lipschitz continuous function limits how fast f changes.

Image for post
Image for post

If we connect f(x₁) and f(x₂) together, its slope has an absolute value always smaller than K which is called the Lipschitz constant of the function. If a function has bounded first derivatives, it is Lipschitz.

Its Lipschitz constant equals the maximum absolute value of the derivatives.

For a sin function, the absolute value of its derivative is always bounded by 1 and therefore it is Lipschitz. The Lipschitz constant equals one. Intuitively, Lipschitz continuity bounds the gradients and beneficial in mitigating gradient explosions in deep learning.

Image for post
Image for post

Graphically, we can construct a double cone with slope K and -K. If we move its origin along the graph, the graph will always stay outside the cone.

Image for post
Image for post
Source

Wasserstein Distance

The original formularization of the Wasserstein Distance is usually intractable. Fortunately, using the Kantorovich-Rubinstein duality, we can simplify the calculation from

Image for post
Image for post

to

Image for post
Image for post

where sup is the least upper bound. We need to find f which is a 1-Lipschitz function that follows the constraint

Image for post
Image for post

and maximizes

Image for post
Image for post

In the next few sections, we try to give a high-level proof without tedious mathematical terms. But these sections are optional reading.

Dual Problem (optional)

Let’s study the duality problem in the context of linear algebra. In optimization, we can associate a dual problem with a primal problem.

Image for post
Image for post
Source

According to the Weak Duality Theorem, z (cᵀx) is the lower bound of v (zv). Therefore, the dual problem establishes an upper bound for the primal problem.

Image for post
Image for post
Proof for Weak duality

Strong Duality claims that the optimal solution for the primal and the dual problem are the same (z* = v*). However, this Strong Duality holds for special conditions only. Optimization problems that can be framed in the form of linear algebra do have Strong Duality (proof). So if the primal problem is hard to solve, we check whether the dual may be easier. That leads to the Kantorovich-Rubinstein duality.

Kantorovich-Rubinstein duality (optional)

So let’s apply this in finding Wasserstein Distance. Our primal and dual problem will be in the form of:

Image for post
Image for post

The primal is

Image for post
Image for post

with

Image for post
Image for post
Source

This is hard to solve and let’s try to solve the dual instead,

Image for post
Image for post

with

Image for post
Image for post
Source

The objective is maximizing f ᵀ Pr + gᵀ Pg while maintaining the constraint f(xᵢ) +g(xⱼ) ≤ Dᵢⱼ.

Image for post
Image for post

Since Pr and Pg are non-negative, it makes sense to pick the highest possible value for g in maximizing the objective, i.e. g =-f instead of g < -f. Without further proof, g =-f is what we want in general. Our constraints become

Image for post
Image for post

That is, we want to limit the rate change of our function. Therefore, to solve the dual problem, we need f to be Lipschitz constraint.

WGAN

To find Wasserstein Distance, we maximize the dual problem with the constraint that f is 1-Lipschitz continuous.

Image for post
Image for post

The term Prf and Pgf become the expected value for f(x) sampled from distribution Pr and Pg respectively.

Image for post
Image for post

Graphically, we want f to have a high value if Pr ≥ Pg (Pθ below) and a low value if Pg < Pr.

Image for post
Image for post
Source

In GAN, we train a discriminator and a generator with the cost functions below.

Image for post
Image for post

In WGAN, we train a Lipschitz function f.

Image for post
Image for post

We optimize w of the critic f with the Wasserstein Distance using the objective.

Image for post
Image for post

To enforce the Lipschitz constraint, WGAN simply clips the value of w.

Image for post
Image for post

WGAN Algorithm

Here is the WGAN algorithm.

Image for post
Image for post
Source

However, as quoted from the research paper:

Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic till optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, … and we stuck with weight clipping due to its simplicity and already good performance.

Tuning the value of c is hard. Finding the right value is difficult. A small change can alter the gradient norm significantly. In addition, some researchers believe that the improvement of image quality in WGAN costs the mode diversity. WGAN-GP introduces the gradient penalty to enforce the Lipschitz constraint. However, this increases the computation complexity.

Spectral norm

The matrix norm or spectral norm is defined as:

Image for post
Image for post

Conceptually, it measures how far the matrix A can stretch a vector x.

According to SVD, a matrix can be decomposed as:

Image for post
Image for post
Image for post
Image for post

where

Image for post
Image for post

U and V are chosen to be the orthogonal (UᵀU=I) eigenvectors of AAᵀ and AᵀA respectively. Both AAᵀ and AᵀA have the same positive eigenvalues. S contains the square root of these eigenvalues which is called singular values.

The spectral norm σ(A) is the maximum singular value of matrix A (proof).

Spectral normalization

Let’s construct a deep network with the following formularization.

Image for post
Image for post
Source

For each layer L, it outputs y = Ax followed by an activation function a, say ReLU.

The Lipschitz constant (or Lipschitz norm) of function g can be computed with its derivative as:

Image for post
Image for post

The Lipschitz constant for g = Ax is the spectral norm of W.

Image for post
Image for post

The Lipschitz constant for RELU is 1 as its maximum slope equals one.

Therefore the Lipschitz constant for the whole deep network f is

Image for post
Image for post
Source

Spectral normalization normalizes the weight for each layer with the spectral norm σ(W) such that the Lipschitz constant for each layer as well as the whole network equals one.

Image for post
Image for post

With Spectral Normalization, we can renormalize the weight whenever it is updated. This creates a network that mitigates gradient explosion problems. But how can we compute σ(W) efficiently since finding eigenvectors are expensive.

Power Iteration

The kth power of A with an eigenvector u is pretty easy to compute.

Image for post
Image for post

We can express a vector v into a linear combination of the eigenvectors of A and compute its kth power with A easily.

Image for post
Image for post

Consider u as eigenvectors of A. We can reorder these eigenvectors with u₁ being the eigenvector with the highest eigenvalue (dominant eigenvector). A vector b can be decomposed into eigenvectors of A. The kth power of A with v will approach:

Image for post
Image for post
Source

Starting with b₀ using the following iteration, b will converge to a multiple of the dominant eigenvector. In short, we can start with a random vector b₀. Multiplying it with A, we can approximate the dominant eigenvector.

Image for post
Image for post

The same concept can be done with u and v computed iteratively as follows (where A is now written as W).

Image for post
Image for post
Modified from source

In SGD, the change of W is small in each iteration. Therefore, the change of the Spectral norm is small also. Therefore, Spectral normalization can reuse the value of u in the previous iteration. Indeed, instead of doing multiple iterations in the power iteration, only one iteration is needed, i.e. we only computer the equation 20 and 21 below once per update only. That significantly reduces the computational complexity. Here is the final algorithm in applying Spectral Normalization.

Image for post
Image for post
Source

Thoughts

The Spectral normalization paper offers a mathematical analysis of the effectiveness of this normalization method over others in section 2.3, 3 and appendix D, E & F. It is not easy to understand but I will encourage people to read it if you want to study it in more details.

Credits and references

Spectral Normalization for Generative Adversarial Networks

Wasserstein GAN and the Kantorovich-Rubinstein Duality

Power method for approximating eigenvalues

Written by

Deep Learning

Get the Medium app