# GAN — Spectral Normalization

GAN is vulnerable to mode collapse and training instability. In this article, we look into Spectral Normalization in resolving these problems. However, to understand Spectral Normalization, we need to review WGAN first. The discussion will involve mathematical concepts that people may not familiar with. We will introduce the terms first and make them approachable later.

In theory, when the generated distribution in GAN and the real distribution are disjoint, the gradient of the cost function will compose of areas of vanished and exploding gradients — the red line below.

WGAN claims that the Wasserstein distance will be a better cost function since it has a smoother gradient (the blue line) for better learning. In practice, with proper tuning, vanilla GAN can generate high-quality images but mode collapse remains a prominent issue. And the real challenge for the Wasserstein distance is how to compute it efficiently. One possibility is applying Lipschitz constraint to compute the Wasserstein distance. WGAN applies a simple clipping to enforce the 1-Lipschitz constraint. WGAN-GP enforces the constraint with a more complex scheme to correct some of the observed problems in WGAN (details on WGAN & WGAN-GP can be found here).

Spectral Normalization is a weight normalization that stabilizes the training of the discriminator. It controls the Lipschitz constant of the discriminator to mitigate the exploding gradient problem and the mode collapse problem. This approach will be computationally light compared with WGAN-GP and achieve good mode coverage.

# Earth-Mover (EM) distance/ Wasserstein Metric

Let’s have a quick review of the Wasserstein distance. First, we want to move 6 boxes from the left below to the locations on the right marked by the dotted square. For box #1, let’s move it from location 1 to location 7. The moving cost is defined as the travel distance. So to move box #1 to location 7, the cost equals 6.

However, there is more than one way to move these boxes. Let’s call the moving plan *i* to be γ*ᵢ*. For example, in the first plan γ₁ below, we move 2 boxes from location 1 to location 10. Below, we show two different plans that both have a moving cost of 42.

However, not all transport plans bear the same cost. The **Wasserstein distance **(or the **EM distance**) is the cost of the cheapest transport plan. In the example below, the Wasserstein distance (minimum cost) is two.

The formalization of the Wasserstein distance can be extended from the discrete space to the continuous space easily as:

Let’s come back to our example again. Each table cell (*x, y*) below corresponds to the number of boxes moving from position *x* to *y*. After normalizing the cell value to a probability value, the moving cost for *γ*₁ is computed as:

Wasserstein distance finds the distance for the cheapest plan.

For any valid plan *γ*, its margins must equal *p*(*x*) and *p*(*y*) respectively.

For example,

In the continuous space,

∏ are the sets of valid plans that its margins equal *p*(*x*) and *p*(*y*) respectively. *γ *is one of these plans that models a joint probability distribution* *for *x* and *y*. We want to find one in ∏ that has the lowest cost.

Next, let’s look into Lipschitz constraint.

# Lipschitz continuity

A real function *f* : *R* → *R* is Lipschitz continuous if

for ∀ real values of *x*₁ and *x*₂. Intuitively, a Lipschitz continuous function limits how fast *f* changes.

If we connect *f*(*x*₁) and *f*(*x*₂) together, its slope has an absolute value always smaller than *K *which is called the **Lipschitz constant** of the function. If a function has bounded first derivatives, it is Lipschitz.

Its Lipschitz constant equals the maximum absolute value of the derivatives.

For a sin function, the absolute value of its derivative is always bounded by 1 and therefore it is Lipschitz. The Lipschitz constant equals one. Intuitively, Lipschitz continuity bounds the gradients and beneficial in mitigating gradient explosions in deep learning.

Graphically, we can construct a double cone with slope *K* and -*K*. If we move its origin along the graph, the graph will always stay outside the cone.

**Wasserstein Distance**

The original formularization of the Wasserstein Distance is usually intractable. Fortunately, using the Kantorovich-Rubinstein duality, we can simplify the calculation from

to

where* **sup* is the least upper bound. We need to find ** f** which is a 1-Lipschitz function that follows the constraint

and maximizes

In the next few sections, we try to give a high-level proof without tedious mathematical terms. But these sections are optional reading.

# Dual Problem (optional)

Let’s study the duality problem in the context of linear algebra. In optimization, we can associate a dual problem with a primal problem.

According to the Weak Duality Theorem, *z* (*cᵀx*) is the lower bound of *v *(*z*≤*v*). Therefore, the dual problem establishes an upper bound for the primal problem.

Strong Duality claims that the optimal solution for the primal and the dual problem are the same (*z** = *v**). However, this Strong Duality holds for special conditions only. Optimization problems that can be framed in the form of linear algebra do have Strong Duality (proof). So if the primal problem is hard to solve, we check whether the dual may be easier. That leads to the Kantorovich-Rubinstein duality.

# Kantorovich-Rubinstein duality (optional)

So let’s apply this in finding Wasserstein Distance. Our primal and dual problem will be in the form of:

The primal is

with

This is hard to solve and let’s try to solve the dual instead,

with

The objective is maximizing *f* ᵀ P*r + g*ᵀ P*g *while maintaining the constraint *f*(*xᵢ*) +*g*(*xⱼ*) ≤ D*ᵢⱼ.*

Since P*r* and P*g *are non-negative, it makes sense to pick the highest possible value for *g *in maximizing the objective*, *i.e.* g* =*-f *instead of *g* < -*f*. Without further proof, *g* =*-f *is what we want in general. Our constraints become

That is, we want to limit the rate change of our function. Therefore, to solve the dual problem, we need *f* to be Lipschitz constraint.

# WGAN

To find Wasserstein Distance, we maximize the dual problem with the constraint that *f* is 1-Lipschitz continuous.

The term P*r*f and P*g*f become the expected value for *f*(*x*) sampled from distribution P*r* and P*g* respectively.

Graphically, we want *f* to have a high value if P*r* ≥ Pg* *(P*θ* below) and a low value if P*g* < P*r*.

In GAN, we train a discriminator and a generator with the cost functions below.

In WGAN, we train a Lipschitz function *f.*

We optimize *w* of the critic *f* with the Wasserstein Distance using the objective.

To enforce the Lipschitz constraint, WGAN simply clips the value of *w*.

# WGAN Algorithm

Here is the WGAN algorithm.

However, as quoted from the research paper:

Weight clipping is a clearly terrible way to enforce a Lipschitz constraint. If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic till optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, … and we stuck with weight clipping due to its simplicity and already good performance.

Tuning the value of *c* is hard. Finding the right value is difficult. A small change can alter the gradient norm significantly. In addition, some researchers believe that the improvement of image quality in WGAN costs the mode diversity. WGAN-GP introduces the gradient penalty to enforce the Lipschitz constraint. However, this increases the computation complexity.

# Spectral norm

The matrix norm or spectral norm is defined as:

Conceptually, it measures how far the matrix *A* can stretch a vector *x*.

According to SVD, a matrix can be decomposed as:

where

*U* and *V* are chosen to be the orthogonal (*UᵀU*=I) eigenvectors of AAᵀ and AᵀA respectively. Both AAᵀ and AᵀA have the same positive eigenvalues. S contains the square root of these eigenvalues which is called singular values.

The spectral norm σ(A) is the maximum singular value of matrix A (proof).

**Spectral normalization**

Let’s construct a deep network with the following formularization.

For each layer *L*, it outputs *y = Ax *followed by an activation function *a, *say ReLU.

The Lipschitz constant (or Lipschitz norm) of function *g* can be computed with its derivative as:

The Lipschitz constant for *g* = *Ax *is the spectral norm of *W*.

The Lipschitz constant for RELU is 1 as its maximum slope equals one.

Therefore the Lipschitz constant for the whole deep network *f* is

Spectral normalization normalizes the weight for each layer with the spectral norm σ(W) such that the Lipschitz constant for each layer as well as the whole network equals one.

With Spectral Normalization, we can renormalize the weight whenever it is updated. This creates a network that mitigates gradient explosion problems. But how can we compute σ(W) efficiently since finding eigenvectors are expensive.

# Power Iteration

The *k*th power of A with an eigenvector *u* is pretty easy to compute.

We can express a vector *v* into a linear combination of the eigenvectors of *A* and compute its *k*th power with A easily.

Consider *u* as eigenvectors of *A*. We can reorder these eigenvectors with u₁ being the eigenvector with the highest eigenvalue (dominant eigenvector). A vector *b* can be decomposed into eigenvectors of A. The *k*th power of *A* with *v* will approach:

Starting with *b₀ *using the following iteration, *b* will converge to a multiple of the dominant eigenvector. In short, we can start with a random vector *b₀*. Multiplying it with *A*, we can approximate the dominant eigenvector.

The same concept can be done with *u* and *v* computed iteratively as follows (where A is now written as W).

In SGD, the change of *W* is small in each iteration. Therefore, the change of the Spectral norm is small also. Therefore, Spectral normalization can reuse the value of *u* in the previous iteration. Indeed, instead of doing multiple iterations in the power iteration, only one iteration is needed, i.e. we only computer the equation 20 and 21 below once per update only. That significantly reduces the computational complexity. Here is the final algorithm in applying Spectral Normalization.

# Thoughts

The Spectral normalization paper offers a mathematical analysis of the effectiveness of this normalization method over others in section 2.3, 3 and appendix D, E & F. It is not easy to understand but I will encourage people to read it if you want to study it in more details.

# Credits and references

Spectral Normalization for Generative Adversarial Networks