In this article, we will look into the details of SGAN which produces some of the highest image quality in GANs. Stacked Generative Adversarial Networks (SGAN) composes of
- an encoder y = E(x) where x is the image and y is its label, and
- a decoder x = G(y, z) where z is the noise.
The decoder here works as the generator in a GAN model.
As the name “stacked” implied, the decoding and the encoding are done in a stack.
But let’s just focus on a design with only 1 single level. The image x is fed into the encoder E1 to predict the label y. Then it is fed into the generator G1 with the noise z1 to generate an image. The generated image is forward to the E1 encoder to predict the label again.
Now, we have a generated image and 2 predicted labels (one for the real image and the other for the generated image). The cost function to train the generator G1 composed of three parts
- Adversary loss: D1 network to discriminate real and generated images.
- Conditional loss: make sure both predicted labels matched.
- Entropy loss: Q1 network in computing an entropy loss. It forces the generated image to be a function of G1(h2, z1) instead of G1(h2).
Before going into details, we modify the labels in the diagram. In a multiple level stack, h is the extracted and generated features from the encoder and the generator respectively.
The adversarial loss is no different than any GAN. We have seen that many times already and therefore we will not elaborate further.
We compare the features encoded by the encoder using the real image and the generated images (the blue line below).
We calculate their distance using the function f, say a Euclidean distance. This assures our generator and encoder create and encode features similar to its counterpart.
The conditional loss above degrades the image diversity. The conditional loss encourages G to create images using G1(h2) instead of G1(h2, z1). It reduces the conditional loss if G ignores the noise.
We create another network Q which shares all the layers with D except the last output dense layer to estimate:
where P is the chance of observing z given the feature h. Here is the entropy loss we add to train the generator
This penalizes the network if z is not related to the latent features of h1 and it forces h1 = G1(h2, z1).
Here is the pseudo code. But we use MSE to compute the loss instead of the cross-entropy.
z0 = theano_rng.uniform(size=(args.batch_size, 16)) # uniform noise...
disc0_layer_z_recon = LL.DenseLayer(disc0_layer_shared, ...)
..., recon_z0 = LL.get_output([disc0_layer_z_recon ...)loss_gen0_ent = T.mean((recon_z0 - z0)**2)
Train the encoder
Unlike other GAN models, the training dataset contains images and labels. Training the encoder is the same as training a classifier using supervised learning.
The remaining training process consists of
- Training individual level of the stack,
- Joint training
On the left side of the figure below, we train each individual level separately and independently (from phase 1 to phase 3). Then we train all levels jointly (the right side).
Training individual level of the stack
Next, we train each stack layer individually.
For a 3-layer stack below, we will have three separate and independent training starting from the bottom to the top. We first train the layer E0 and G0. Once it is done, we train E1 and G1. At last, we train E2 and G2.
Finally, we have the joint training using all layers.
But, we will not take the output from the encoder as input to the generator. We use the output from the upper level of the generator instead:
Here is the flow to create an image of label y.
Finally, this is the diagram summarize the whole flow for your reference.
Training GAN is hard. By splitting the training into multiple layers, we can use divide-and-conquer to achieve higher image quality than a single layer.
To learn more about the articles in the GAN series: