GAN — A comprehensive review into the gangsters of GANs (Part 1)

Image for post
Image for post
Photo by Ben White

Are we there yet? In this GAN series, we identify a general pattern on how GAN is applied to deep learning problems and look into the problems of why GAN is so hard to train. We also check out some potential solutions. By reviewing them in one context, let’s understand the motivation and the direction of the GAN research and the thought process behind them. Hopefully, by the end, we can maneuver an immense amount of information with some form of clarity.

Applications

Image for post
Image for post

No more horror-movie looks, this is history. In 2017, we create 1024 × 1024 images (the right image above) that may fool a talent scout. Let’s start with a basic idea of generating data x.

Image for post
Image for post

Without any guidance, the generator is creating random noise only. In the Goodfellow’s 2014 GAN paper, a discriminator is added to provide guidance to imitate real images. If you want a refresh course in what GAN is, this article from our series will help.

Image for post
Image for post

While early GAN research focuses on the image generation, it has since expanded to other areas, like anime characters, music videos and other industries including medical. Another popular expansion is the cross-domain GAN in transforming data from domain A (say text) to domain B (image).

Image for post
Image for post
Modified from source

In the example above, we add an encoder to extract features of the text followed by a generator to create images. Real and generated images are fed into a discriminator (a.k.a. an encoder followed by a sigma function) to identify whether it is real.

Image for post
Image for post

One nice demonstration of this cross-domain transfer is the CycleGAN which converts a real scenery to a Monet style painting (or vice versa).

Image for post
Image for post
CycleGAN

Many domain-transfer GANs add a second pathway to reconstruct the original image. An additional reconstruction cost is added to guide the encoder to extract features from y better. So the generated images resemble the important features of the original input.

Image for post
Image for post

Cross-domain GANs have many potential commercial applications. We encourage you to look into the CycleGAN article in our series later if you are interested in domain transfer applications.

Another type of GANs uses meta-data in generating images. For example, we create images in different poses or view them in different angles.

Image for post
Image for post
Transform the pose of the original image. (Modified from source)

In this example, we add the meta-data as an additional input to the encoder to generate images. We also pass the original image (or sometimes the meta-data) as an additional input to the discriminator to distinguish images.

Image for post
Image for post

Our last type of GANs does not really need to be a GAN. Everyone can take an advice from a critic once a while. Indeed, the concept of a critic (a discriminator) has been formulated in the reinforcement learning a while back. Researchers can apply GAN in existing solutions, like object detection, to refine their results. In another example, a deep network encoder and decoder are used to reconstruct higher-resolution images, and uses the mean square error MSE to train the network.

Image for post
Image for post

To introduce the GAN concept, we use a discriminator to guide the training of the generator.

Image for post
Image for post

To learn more at this type of application, you can read the super-resolution article from our series. It gives more details on applying the GAN concept to produce human-pleasing super-resolution images.

The following is the general idea in applying GAN to an existing solution.

Image for post
Image for post

In this section, we group different GAN applications together to illustrate their similarity, so you don’t feel there are hundred-types of GAN applications. If you want to see them in real actions later, visit our GAN applications article to visualize what they do and how their network may look like.

Problems

Training GAN is not easy. GAN models may suffer the following problems:

  • Mode collapse: the generator produces limited varieties of samples,
  • Diminished gradient: the discriminator gets too successful that the gradients vanish and the generator learns nothing,
  • Non-convergence: the model parameters oscillate, destabilize and never converge,
  • Unbalance between the generator and discriminator causes overfitting, and
  • Highly sensitive to hyperparameters.

Mode

Mode collapses when generated images converge to the same image (the same optimal point).

Image for post
Image for post
Source

Full mode collapse is not common. But partial collapse happens often. About half of the images below have one similar image.

Image for post
Image for post
Images underlined with the same color looks similar (Modified from source)

We have limited understanding of the full dynamics of mode collapse in practice. Mitigation methods are proposed to penalize the generator if it happens. However, partial collapse remains common.

Generator gradient

Gradient descent relies on the gradients to learn. This behaves like dropping a marble into a bowl. However, if we place a marble carefully on the edge of the bowl below, it may not drop to the bottom.

Image for post
Image for post
Modified from a photo from Anthony N in a Yelp review

Optimizing a GAN generator is close to optimizing the JS-divergence. The figure below visualizes the value function of a JS-divergence. There is a low gradient region that just likes the edge of our bowl. At least in the beginning, the training is very slow.

Image for post
Image for post

Non-convergence

GAN is a game where your opponent always counteracts your actions. The optimal solution is known as Nash equilibrium which is hard to find. Gradient descent is not necessarily a stable method for finding such equilibrium. When mode collapses, the training turns into a cat-and-mouse game in which the model will never converge. Just another thought, maybe the nature of the game makes GANs hard to converge.

Other problems

The non-convergence and mode collapse is often interpreted as an imbalance between the discriminator and the generator. The discriminator may overwhelm the other (or vice versa). There are many attempts at addressing the problem but not much progress has been made in the first few years. Some researchers believe that this is not a feasible or a desirable goal since a good discriminator gives good feedback. However, some progress have been made lately with a more dynamic scheme in balancing their training.

GAN is sensitive to hyperparameter optimization. The performance can fluctuate within a short range of the hyperparameters. Proof the code and the model are working first. Then be patience in tuning those parameters.

For example, the following figure demonstrates the performance (y-axis) between various learning rates (x-axis) under different cost functions. The large range of performance difference may cloud your judgment on whether your design is working.

Image for post
Image for post

If you are interested in developing solutions to improve the GAN training, you need more information than what we explain here. This is a followup article in our GAN series that explains them in more details (again for your later reference).

Measurement

GAN’s objective functions measure the competition between the generator and the discriminator. However, these metrics do not reflect the image quality and not suitable for model comparison, progress monitor and performance tuning. In the figure below, the generator cost increases even the image quality improves.

Image for post
Image for post
Source

In early research, we compare model results visually which are strongly biased. Many “state-of-the-art” claims from early research papers are hard to verify or overstated. To address that, Inception Score (IS) is developed to measure the image quality and the diversity. If we can label generated images correctly while the generated images are evenly distributed among different object classes, we give them a high IS score.

Fréchet Inception Distance (FID) measures the statistical difference between the features of the real and generated images extracted by an Inception network. A low FID distance indicates the generated images are natural with similar diversity as the real images. To learn more about the definitions of IS and FID, and their weakness, we provide another article in measuring GAN performance.

Now we know what is GAN, how to use GAN and what’s wrong with GAN. So how to solve the training problems in GAN. In part 2, we will give you an overview on the solutions.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store