GAN — A comprehensive review into the gangsters of GANs (Part 1)
Are we there yet? In this GAN series, we identify a general pattern on how GAN is applied to deep learning problems and look into the problems of why GAN is so hard to train. We also check out some potential solutions. By reviewing them in one context, let’s understand the motivation and the direction of the GAN research and the thought process behind them. Hopefully, by the end, we can maneuver an immense amount of information with some form of clarity.
No more horror-movie looks, this is history. In 2017, we create 1024 × 1024 images (the right image above) that may fool a talent scout. Let’s start with a basic idea of generating data x.
Without any guidance, the generator is creating random noise only. In the Goodfellow’s 2014 GAN paper, a discriminator is added to provide guidance to imitate real images. If you want a refresh course in what GAN is, this article from our series will help.
While early GAN research focuses on the image generation, it has since expanded to other areas, like anime characters, music videos and other industries including medical. Another popular expansion is the cross-domain GAN in transforming data from domain A (say text) to domain B (image).
In the example above, we add an encoder to extract features of the text followed by a generator to create images. Real and generated images are fed into a discriminator (a.k.a. an encoder followed by a sigma function) to identify whether it is real.
One nice demonstration of this cross-domain transfer is the CycleGAN which converts a real scenery to a Monet style painting (or vice versa).
Many domain-transfer GANs add a second pathway to reconstruct the original image. An additional reconstruction cost is added to guide the encoder to extract features from y better. So the generated images resemble the important features of the original input.
Cross-domain GANs have many potential commercial applications. We encourage you to look into the CycleGAN article in our series later if you are interested in domain transfer applications.
Another type of GANs uses meta-data in generating images. For example, we create images in different poses or view them in different angles.
In this example, we add the meta-data as an additional input to the encoder to generate images. We also pass the original image (or sometimes the meta-data) as an additional input to the discriminator to distinguish images.
Our last type of GANs does not really need to be a GAN. Everyone can take an advice from a critic once a while. Indeed, the concept of a critic (a discriminator) has been formulated in the reinforcement learning a while back. Researchers can apply GAN in existing solutions, like object detection, to refine their results. In another example, a deep network encoder and decoder are used to reconstruct higher-resolution images, and uses the mean square error MSE to train the network.
To introduce the GAN concept, we use a discriminator to guide the training of the generator.
To learn more at this type of application, you can read the super-resolution article from our series. It gives more details on applying the GAN concept to produce human-pleasing super-resolution images.
The following is the general idea in applying GAN to an existing solution.
In this section, we group different GAN applications together to illustrate their similarity, so you don’t feel there are hundred-types of GAN applications. If you want to see them in real actions later, visit our GAN applications article to visualize what they do and how their network may look like.
Training GAN is not easy. GAN models may suffer the following problems:
- Mode collapse: the generator produces limited varieties of samples,
- Diminished gradient: the discriminator gets too successful that the gradients vanish and the generator learns nothing,
- Non-convergence: the model parameters oscillate, destabilize and never converge,
- Unbalance between the generator and discriminator causes overfitting, and
- Highly sensitive to hyperparameters.
Mode collapses when generated images converge to the same image (the same optimal point).
Full mode collapse is not common. But partial collapse happens often. About half of the images below have one similar image.
We have limited understanding of the full dynamics of mode collapse in practice. Mitigation methods are proposed to penalize the generator if it happens. However, partial collapse remains common.
Gradient descent relies on the gradients to learn. This behaves like dropping a marble into a bowl. However, if we place a marble carefully on the edge of the bowl below, it may not drop to the bottom.
Optimizing a GAN generator is close to optimizing the JS-divergence. The figure below visualizes the value function of a JS-divergence. There is a low gradient region that just likes the edge of our bowl. At least in the beginning, the training is very slow.
GAN is a game where your opponent always counteracts your actions. The optimal solution is known as Nash equilibrium which is hard to find. Gradient descent is not necessarily a stable method for finding such equilibrium. When mode collapses, the training turns into a cat-and-mouse game in which the model will never converge. Just another thought, maybe the nature of the game makes GANs hard to converge.
The non-convergence and mode collapse is often interpreted as an imbalance between the discriminator and the generator. The discriminator may overwhelm the other (or vice versa). There are many attempts at addressing the problem but not much progress has been made in the first few years. Some researchers believe that this is not a feasible or a desirable goal since a good discriminator gives good feedback. However, some progress have been made lately with a more dynamic scheme in balancing their training.
GAN is sensitive to hyperparameter optimization. The performance can fluctuate within a short range of the hyperparameters. Proof the code and the model are working first. Then be patience in tuning those parameters.
For example, the following figure demonstrates the performance (y-axis) between various learning rates (x-axis) under different cost functions. The large range of performance difference may cloud your judgment on whether your design is working.
If you are interested in developing solutions to improve the GAN training, you need more information than what we explain here. This is a followup article in our GAN series that explains them in more details (again for your later reference).
GAN’s objective functions measure the competition between the generator and the discriminator. However, these metrics do not reflect the image quality and not suitable for model comparison, progress monitor and performance tuning. In the figure below, the generator cost increases even the image quality improves.
In early research, we compare model results visually which are strongly biased. Many “state-of-the-art” claims from early research papers are hard to verify or overstated. To address that, Inception Score (IS) is developed to measure the image quality and the diversity. If we can label generated images correctly while the generated images are evenly distributed among different object classes, we give them a high IS score.
Fréchet Inception Distance (FID) measures the statistical difference between the features of the real and generated images extracted by an Inception network. A low FID distance indicates the generated images are natural with similar diversity as the real images. To learn more about the definitions of IS and FID, and their weakness, we provide another article in measuring GAN performance.
Now we know what is GAN, how to use GAN and what’s wrong with GAN. So how to solve the training problems in GAN. In part 2, we will give you an overview on the solutions.