A 2017 Google Brain paper “Are GANs Created Equal?” claims that
Finally, we did not find evidence that any of the tested algorithms consistently outperforms the original one.
The paper advocates we should spend time in hyperparameter optimization rather than testing different cost functions. Are those cost functions like LSGAN, WGAN-GP or BEGAN dead? Should we stop using them? In this article, we look into the details of the presented data and try to answer this question ourselves.
Let’s look into two major claims quoted from the paper:
- When the budget is limited, any statistically significant comparison of the models is unattainable, and
- Algorithmic differences in state-of-the-art GANs become less relevant, as the computational budget increases.
This is true for the two low complexity dataset in the experiments. However, WGAN-GP performs better in FID score (the lower the better) comparing with others in CELEBA. In CIFAR10, cost functions perform differently at any budget level.
If we follow the figure below, WGAN and WGAN-GP have the best FID score in CIFAR10 and CELEBA respectively. So for complex dataset, cost functions may matter.
In practice, the dataset used by many partitioners are pretty complex. Many projects have not reached commercial quality yet so pushing the image quality remains a high priority. From the presented data, testing them with different cost functions will likely continue.
Hyperparameter & Computational budget
There is no single cost function performs consistently better than others among different datasets. The paper claims new cost functions are not necessarily more stable or more predictable in training GAN. Because of the wide range of performance under different hyperparameters, hyperparameter tuning is particularly important for any cost functions, and therefore it will have a better return of investments.
But let’s look into the details of how the paper conducts the tuning. Here are the lists of hyperparameters tuned for each algorithm.
Hyperparameters like beta and batch normalization have a wide range of performance in some cost functions.
But beta is eventually set to 0.5 and there is a preferred batch normalization setting for each cost function usually. Including a large range of hyperparameter searching may put some cost functions in unnecessary disadvantage and overestimate the resource requirement.
What is my take?
So what is my take? We have not found the best cost functions that work consistently better than others. But there seems to be one that works better in a particular complex datasets. However, no model performs well with non-optimal hyperparameters. Tuning hyperparameters for GAN models is harder and takes longer. So if you don’t have enough time, let’s find the best of what your favorite cost function can do. If your performance plateaus or is stuck, you may give other cost function a chance. So the real question is not necessarily whether we should try new cost functions. The real question is when.
During the development of new cost functions, we do realize some trends that worth pursuing. Even we may not find a cost function that always blows the other away yet, but the effort will likely continue. In the meantime, we will deal with a lot of conflicting information. But as time progress, we will correct some misconceptions and understand things better. In fact, the Google Brain paper gives us a lot of raw data on the impacts of particular hyperparameters on different cost functions. This information is highly recommended when we plan our tuning. Eventually, trying out different cost functions will be much easier even with very limited resources.
Thoughts on F1, precision and recall score
In the Google paper, it also demonstrates many new cost functions have low F1 and recall scores for its polynomial toy experiment. However, the low scores for BEGAN in CELEBA cannot reconcile with its relatively high FID score. So the implication of these measurements on complex datasets remains questionable.
In addition, partial mode collapse (or lower recall) is not necessarily all bad news for many GAN applications if image quality improves.