Photo by Stefan Cosma

GAN — Self-Attention Generative Adversarial Networks (SAGAN)

How can GAN use attention to improve image quality, like how attention improves accuracy in language translation and image captioning? For example, an image captioning deep network focuses on different areas of the image to generate words in the caption.


For GAN models trained with ImageNet, they are good at classes with a lot of texture (landscape, sky) but perform much worse for structure. For example, GAN may render the fur of a dog nicely but fail badly for the dog’s legs. While convolutional filters are good at exploring spatial locality information, the receptive fields may not be large enough to cover larger structures. We can increase the filter size or the depth of the deep network but this will make GANs even harder to train.



For each convolutional layer,

  • Compute the self-attention output.
Modified from source
Visualization of the attention map for the location marked by the red dot. source

Loss function

SAGAN uses hinge loss to train the network:


Self-attention does not apply to the generator only. Both the generator and the discriminator use the self-attention mechanism. To improve the training, different learning rates are used for the discriminator and the generator (called TTUR in the paper). In addition, spectral normalization (SN) is used to stabilize the GAN training. Here is the performance measure in FID (the lower the better).


Further readings


Self-Attention Generative Adversarial Networks

Deep Learning