TensorFlow Generative Model Examples

18 min readFeb 9, 2021

In this article, we cover the TensorFlow generative models includes:

DCGAN
CycleGAN
Pix2Pix
Neural style transfer
Adversarial FGSM
Autoencoder
Autoencoder Denoising
Autoencoder Anomaly detection
Convolutional Variational Autoencoder (VAE)
DeepDream

DCGAN

DCGAN is one popular design for GAN. It composes of convolution and transposed convolutional layers without max pooling or fully connected layers. The figure below is the network design for the generator. This example will be trained to generate MNIST digits. We will start the article with this most basic GAN model.

Dataset setup for MNIST samples:

Here is the generator which is quite self-explainable. We will not add too many comments on DCGAN because the code is pretty common in deep learning.

Discriminator:

Here is the loss function. For the discriminator, we expect real images to be labeled as one and generated images to be 0. For the generator loss, we want the discriminator to classify it as 1.

The optimizer and the checkpoint:

Here is the training step:

The training loop:

Next, we start the model training. Previously, we randomly sample sixteen z values. At the end of each epoch training, we generate 16 images for these 16 z values. We save these images and once the training is done, we stitch them together as an animated GIF. So for each sampled z, we can see how the generated images change as the training progress.

Here is the animation showing the training progress.

CycleGAN

CycleGAN applies a deep network G to transform one type of image into another, say from photographs to Van Gogh-like paintings. To train this model, we also train a discriminator to distinguish real Van Gogh paintings from the generated ones. We also train another deep network F to restore the original image. The whole model is trained to reduce the generator and the discriminator loss as well as the reconstruction loss. So, G will transform one type of image to another that even discriminator D has a hard time distinguishing from the real one.

Here are some applications that convert one type of picture to another.

Here are the general setup and sample preparation. In this example, we transform horse pictures into zebra.

In this article, we will not detail the boilerplate codes that are common in TensorFlow.

Image preparation and data augmentation:

Prepare the datasets:

Here are the generator and discriminator models from tensorflow_examples.models.

Here are the loss functions. It composes of the typical generator and discriminator loss. It has two more loss functions. The cycle loss is a kind of reconstruction loss. “real x” below is a real horse image and “real y” is a real zebra image. If a real horse passes through G and F, we should get the horse image back. If we pass a real zebra through F and G, we should get the zebra back. What is identity loss? G transforms a horse into a zebra. But if we feed a zebra into G, we should still handle it properly and generate a zebra. The identity loss ensures that.

Optimizer and checkpoints codes:

We will train the model for 40 epochs. Here is the code for generating a zebra image from a horse picture and plotting them out.

The first part of the training step is to generate images for different scenarios and compute the associated loss.

The second part applies the gradient descent.

Training loop and image generation:

Here are the possible results. But prepare to train it much longer.

Pix2Pix

Pix2Pix generates an image given a conditional image x. This image guides what image to be generated.

Like other GANs, it trains a discriminator to differentiate the real from the generated one with additional input from this conditional image.

Here are some other potential applications.

Datafile setup:

Here are the functions in loading, manipulating, and augmenting images.

Next, we prepare the datasets.

Here are the downsampling and upsampling layer using convolution and transpose convolution.

This is the generator. Skip connections are applied to the upsampling layer from the downsampling layer at the same spatial resolution.

Here are some of the skips connections:

The generator loss:

Here are the discriminator and the loss function.

Optimizer, checkpoint, and a utility function in generating images using the model with conditional testing images:

generate_images also plots out a sample, its ground truth, and the model prediction.

Training step:

Training and images generation:

Here are some generated images:

Neural style transfer

“Neural style transfer” transfers the style of an image to another image while keeping its content context.

In this example, we use VGG19 to extract content features from a target content image and style features from a target-style image. Then, we use the content image as the starting source. We extract the source content and style features and compare them with the target content and the target style features. We compute the corresponding MSE (mean square error) and use the loss gradient to push the source images towards the target style while keeping the content features close to the original.

Setup:

Here is the utility code for loading and displaying images.

We can download a pre-trained TF Hub model for the style transfer.

Here is a real dog picture transformed with a painting style.

Next, we will redo the exercise. This time, we use a VGG19 headless model (without the classification head) to extract features and build codes to perform the style transfer ourselves. The following diagram is the model summary.

We use features closer to the input layers to capture the style (color, stroke style, etc …) and features closer to the output for the content information.

The style of an image is characterized by a Gram matrix that measures the relationships between features. For example, how certain strokes may relate to certain colors.

Finally, we build a model to extract the content and style features of an image.

Next, we extract style from the style image and content from the content image. And we start with the content image as the source for the neural style transfer. We also have a clip function to make sure the generated pixels are within 0 and 1.

Our total loss is a weighted MSE for the style and the content features between the newly generated image and the target style and target content respectively.

With gradient descent, our training pushes the generated image in reducing style and content losses.

But the generated image is noisy. So we add a variation loss to reduce noise. First, we shift the image to the top left by one pixel. Then, we compute the absolute difference. The difference signals the high-frequency area. We sum up these high-pass signals so we can move trainable weights towards smoother images.

Here is the training loop with the additional variational loss. But, we use the built-in tf.image.total_variation instead of the custom method above.

Here is an example of the style transferred images.

Adversarial FGSM

By wearing special and tailored eyeglasses, the persons in the top rows are all misidentified as the well-known people in the second row.

The essence is can we design small enough noise (the middle column) that has no visual impacts on the source but causes the result images (the right column) to be misclassified.

This technique applies pixel-wise changes along the maximum loss gradient direction. So, the logits score for the ground truth class will be significantly dropped with little or no visual changes. In this example, we will use a pre-trained MobileNet V2 as the classifier.

Prepare a Labrador image:

With one single gradient step, we can permute the image to be misclassified already. In this example, we play with different values of ε (the higher the value, the more visual distortion to the image).

Autoencoder

Autoencoder uses an encoder to extract latent features z and uses a decoder to reconstruct the image.

Data setup:

Here are the autoencoder and the training.

Original and reconstructed images:

Autoencoder — Denoising

We can use an autoencoder to denoise images.

First, we need to create noisy images from the training images. These images will be used as source images during training.

The idea is simple. Our source images are the noisy images and the target images are the original images. We train an autoencoder to convert noisy images to clean images. The autoencoder is trained to extract features without noise.

Autoencoder — Anomaly detection

ECG records the electrical activity generated by the heart.

In this dataset, it contains a CSV file with normal and anomalous ECG samples. First, we read the samples and prepare them.

The AnonalyDetector is a simple autoencoder. We train the data with normal ECG only, expecting it to reconstruct the normal ECG nicely. We use the MAE (mean absolute error) as the loss function.

But when we reconstruct the data with the anomalous ECG, we expect it to have a much higher MAE reconstruction loss. We do not expect the autoencoder to extract features present in Anomalous ECG but not in normal ECG. Therefore, the autoencoder will have a harder time reconstructing them.

We find the MAE reconstruction losses for all the normal ECG and compute its standard deviation. In the example below, we compute the reconstruction loss for all the anomalous ECG. When the error is higher than one standard deviation over this threshold, we flag it as anomalous.

In this example, it also prints out the accuracy, precision, and recalls for the anomalous ECG samples.

Convolutional Variational Autoencoder (VAE)

This example trains a Variational Autoencoder (VAE) in generating MNIST digits. In contrast to the autoencoder, VAE generates the parameters of a probability distribution on latent features instead of latent features. Modeling a probability distribution, instead of a point estimate, may capture and reason uncertainty in real-life problems better.

If we assume the latent feature z has a distribution of a multivariate Gaussian with a diagonal covariance matrix, the encoder predicts the mean and the variance for each feature in z. To reconstruct the image, we sample z from the predicted distribution. For example,

And then, we decode z with the decoder.

Reparameterization trick

So how can we perform backpropagation when one of the operations is sampling? We can formulate the objective with expectation and differentiate it.

The calculation will involve the gradient of the log probability (underline in red above). Unfortunately, many distributions are not differentiable or the gradient is hard to compute. To address that, we apply a technique called the reparameterization trick.

It simply moves the sampling operation out of the backpropagation path of the trainable parameters. It changes the original sampling operation into x = μ + σ ε where ε is sampled from Ɲ(0, 1). Now, the equation for x is simple and easy to differentiate. And ε is sampled outside the backpropagation path of μ and σ. So we don’t need to compute the corresponding gradient.

VAE Loss function ELBO (optional)

So what is the loss function for VAE? We can summarize it mathematically as:

where

This sounds very hard, so let’s explain it one step at a time and derive the VAE loss function accordingly. KL Divergence measures the difference between two distributions and it is defined as:

Let’s replacing (x) with (z|x),

Encoder q is parameterized by 𝜙 and predicts the distribution of z given x.

After applying Bayes’ Theorem

and some simple shuffling, the KL divergence becomes (proof):

Let’s name the last two terms ELBO. Because KL-divergence is always greater or equal to 0, the log evidence (log p(x)) is always greater than ELBO.

That is why it is called evidence lower bound (ELBO). Since log p(x) is constant w.r.t. 𝜙, the KL-divergence adds up with ELBO to a constant — KL-divergence drops when ELBO increases. Minimizing the KL-divergence is the same as maximizing ELBO.

But how can we compute the ground truth p(z) and p(x|z)? We introduce the latent factor z and we have a great deal of freedom in making further assumptions or constraints. For example, we can assume z has a standard normal distribution Ɲ(0, 1). So the KL term can be computed analytically. For instance,

To estimate p(x|z), we model it with a model p parameterized by θ.

So the VAE objective becomes:

The first term in ELBO is a generation loss measuring how well we reconstruct the image. The second term is a latent loss penalizes whenever

The latent loss ensures the z distribution from the encoder to be close to our prior belief p(z).

In particular, our decoder generates pixel values, not probabilities. But, we do not build a new model to estimate p(x| z). Instead, we repurpose the decoder for this. In our example, the image is binary (black and white). We can feed the logit outputs of the decoder to a sigmoid function to generate probability values. Then, we treat the problem as a binary classification and the generation loss is simply the cross-entropy loss.

How can we use the decoder in general for this purpose? The objective of the generation loss is to maximize the log-likelihood of x given z. We want to optimize an encoder such that the extracted latent factors are adequate enough to rebuild x. But we can expand the scope a little more. We can rephrase this objective to be minimizing the reconstruction loss. As an image passes through the encoder and the decoder, we want to optimize both the encoder and the decoder to have the smallest reconstruction loss. Therefore, in some model training, the generation loss is replaced by the MSE of the reconstructed images instead.

Coding

Dataset setup:

This is the first part of the VAE model that contains an encoder and a decoder using convolution and transpose convolution.

This is the rest of the VAE class. “sample” samples latent features from a standard normal distribution and decodes them. “encode” predicts the means and the log variances of the features z from input x. “reparameterize” samples values using the reparameterization trick.

We will use the ELBO discussed before as the objective function.

While we provide the theoretical equations in computing ELBO, here is what we do in practice using Monte Carlo (sample z and compute the expected values). We sample z from the distribution predicted by the encoder and compute the ELBO value below.

p(x|z): the logit prediction of the decoder will pass through a sigmoid function to predict this probability.
p(z): p(z) is assumed to have a standard normal distribution. Given a sampled z, we find the corresponding pdf value for z under this distribution.
q(z|x): Given the distribution predicted by the encoder, we compute the corresponding pdf value for the sampled z.

log_normal_pdf computes the log probability for “sample” that belongs to a normal distribution with mean and logvar. Then, compute_loss computes the ELBO.

Here is the training step. Our latent feature z is configured to have a dimension of 2. We also generate 16 random testing samples on z to be used later.

generate_and_save_images takes in an image, encodes it, and decodes it. We save the reconstructed images such that we can accumulate them and replay them later as an animated GIF. We also pick 16 testing images for later use.

We train the model and for each epoch, we save 16 reconstructed images to visualize the training.

After the training is done, we stitch all the reconstructed images together as a GIF file to show the training progress.

The latent factor z in this example has 2 components. We change them gradually and plot out the corresponding generated images. This technique gives us a big picture of what these latent factors are controlling.

DeepDream

DeepDream is about exaggerating features in an image. Our objective is to apply gradient ascend to

such that we change the image to boost the highly activated features in a layer. Higher activation triggers a higher boost. The feature extraction is similar to the neural style transfer. If we choose a style layer(s) to boost, we exaggerate the style. If we choose a content layer(s), we boost the content structure (like the eye).

Image setup:

We use Inception V3 for extraction and pre-select two layers to boost its activation. We compute the loss based on the activation.

Our training applies gradient ascend to exaggerate selected-layer activations.

Here is the training.

To improve quality and resolution, we gradually upsampling the images and at each resolution, we apply DeepDream again before the upsampling.

But apply a transformation to a large upsampled image can be expensive. Alternatively, we can upsample an image, and divide the image into tiles and apply the transformations to the tiles separately. But to avoid the edge effect between tiles, we will roll an image randomly before breaking it into tiles and then unroll it back.