Sitemap

TensorFlow Generative Model Examples

18 min readFeb 9, 2021
Press enter or click to view image in full size

In this article, we cover the TensorFlow generative models includes:

  • DCGAN
  • CycleGAN
  • Pix2Pix
  • Neural style transfer
  • Adversarial FGSM
  • Autoencoder
  • Autoencoder Denoising
  • Autoencoder Anomaly detection
  • Convolutional Variational Autoencoder (VAE)
  • DeepDream

DCGAN

DCGAN is one popular design for GAN. It composes of convolution and transposed convolutional layers without max pooling or fully connected layers. The figure below is the network design for the generator. This example will be trained to generate MNIST digits. We will start the article with this most basic GAN model.

Press enter or click to view image in full size
Source

Dataset setup for MNIST samples:

Press enter or click to view image in full size

Here is the generator which is quite self-explainable. We will not add too many comments on DCGAN because the code is pretty common in deep learning.

Press enter or click to view image in full size

Discriminator:

Press enter or click to view image in full size

Here is the loss function. For the discriminator, we expect real images to be labeled as one and generated images to be 0. For the generator loss, we want the discriminator to classify it as 1.

Press enter or click to view image in full size

The optimizer and the checkpoint:

Press enter or click to view image in full size

Here is the training step:

Press enter or click to view image in full size

The training loop:

Press enter or click to view image in full size

Next, we start the model training. Previously, we randomly sample sixteen z values. At the end of each epoch training, we generate 16 images for these 16 z values. We save these images and once the training is done, we stitch them together as an animated GIF. So for each sampled z, we can see how the generated images change as the training progress.

Press enter or click to view image in full size

Here is the animation showing the training progress.

CycleGAN

CycleGAN applies a deep network G to transform one type of image into another, say from photographs to Van Gogh-like paintings. To train this model, we also train a discriminator to distinguish real Van Gogh paintings from the generated ones. We also train another deep network F to restore the original image. The whole model is trained to reduce the generator and the discriminator loss as well as the reconstruction loss. So, G will transform one type of image to another that even discriminator D has a hard time distinguishing from the real one.

Press enter or click to view image in full size

Here are some applications that convert one type of picture to another.

Press enter or click to view image in full size
Source

Here are the general setup and sample preparation. In this example, we transform horse pictures into zebra.

Press enter or click to view image in full size

In this article, we will not detail the boilerplate codes that are common in TensorFlow.

Image preparation and data augmentation:

Press enter or click to view image in full size

Prepare the datasets:

Press enter or click to view image in full size

Here are the generator and discriminator models from tensorflow_examples.models.

Press enter or click to view image in full size

Here are the loss functions. It composes of the typical generator and discriminator loss. It has two more loss functions. The cycle loss is a kind of reconstruction loss. “real x” below is a real horse image and “real y” is a real zebra image. If a real horse passes through G and F, we should get the horse image back. If we pass a real zebra through F and G, we should get the zebra back. What is identity loss? G transforms a horse into a zebra. But if we feed a zebra into G, we should still handle it properly and generate a zebra. The identity loss ensures that.

Press enter or click to view image in full size
Press enter or click to view image in full size

Optimizer and checkpoints codes:

Press enter or click to view image in full size

We will train the model for 40 epochs. Here is the code for generating a zebra image from a horse picture and plotting them out.

Press enter or click to view image in full size

The first part of the training step is to generate images for different scenarios and compute the associated loss.

Press enter or click to view image in full size

The second part applies the gradient descent.

Press enter or click to view image in full size

Training loop and image generation:

Press enter or click to view image in full size

Here are the possible results. But prepare to train it much longer.

Press enter or click to view image in full size
Source

Pix2Pix

Pix2Pix generates an image given a conditional image x. This image guides what image to be generated.

Press enter or click to view image in full size
Source

Like other GANs, it trains a discriminator to differentiate the real from the generated one with additional input from this conditional image.

Press enter or click to view image in full size
Source

Here are some other potential applications.

Press enter or click to view image in full size
Source

Datafile setup:

Press enter or click to view image in full size

Here are the functions in loading, manipulating, and augmenting images.

Press enter or click to view image in full size

Next, we prepare the datasets.

Press enter or click to view image in full size

Here are the downsampling and upsampling layer using convolution and transpose convolution.

Press enter or click to view image in full size

This is the generator. Skip connections are applied to the upsampling layer from the downsampling layer at the same spatial resolution.

Press enter or click to view image in full size

Here are some of the skips connections:

Press enter or click to view image in full size
Source

The generator loss:

Press enter or click to view image in full size

Here are the discriminator and the loss function.

Press enter or click to view image in full size

Optimizer, checkpoint, and a utility function in generating images using the model with conditional testing images:

Press enter or click to view image in full size

generate_images also plots out a sample, its ground truth, and the model prediction.

Press enter or click to view image in full size

Training step:

Press enter or click to view image in full size

Training and images generation:

Press enter or click to view image in full size

Here are some generated images:

Press enter or click to view image in full size
Source

Neural style transfer

“Neural style transfer” transfers the style of an image to another image while keeping its content context.

Press enter or click to view image in full size
Source

In this example, we use VGG19 to extract content features from a target content image and style features from a target-style image. Then, we use the content image as the starting source. We extract the source content and style features and compare them with the target content and the target style features. We compute the corresponding MSE (mean square error) and use the loss gradient to push the source images towards the target style while keeping the content features close to the original.

Setup:

Press enter or click to view image in full size

Here is the utility code for loading and displaying images.

Press enter or click to view image in full size

We can download a pre-trained TF Hub model for the style transfer.

Press enter or click to view image in full size

Here is a real dog picture transformed with a painting style.

Press enter or click to view image in full size

Next, we will redo the exercise. This time, we use a VGG19 headless model (without the classification head) to extract features and build codes to perform the style transfer ourselves. The following diagram is the model summary.

Press enter or click to view image in full size

We use features closer to the input layers to capture the style (color, stroke style, etc …) and features closer to the output for the content information.

Press enter or click to view image in full size

The style of an image is characterized by a Gram matrix that measures the relationships between features. For example, how certain strokes may relate to certain colors.

Press enter or click to view image in full size
Press enter or click to view image in full size

Finally, we build a model to extract the content and style features of an image.

Press enter or click to view image in full size

Next, we extract style from the style image and content from the content image. And we start with the content image as the source for the neural style transfer. We also have a clip function to make sure the generated pixels are within 0 and 1.

Press enter or click to view image in full size

Our total loss is a weighted MSE for the style and the content features between the newly generated image and the target style and target content respectively.

Press enter or click to view image in full size

With gradient descent, our training pushes the generated image in reducing style and content losses.

Press enter or click to view image in full size

But the generated image is noisy. So we add a variation loss to reduce noise. First, we shift the image to the top left by one pixel. Then, we compute the absolute difference. The difference signals the high-frequency area. We sum up these high-pass signals so we can move trainable weights towards smoother images.

Press enter or click to view image in full size

Here is the training loop with the additional variational loss. But, we use the built-in tf.image.total_variation instead of the custom method above.

Press enter or click to view image in full size

Here is an example of the style transferred images.

Press enter or click to view image in full size
Source

Adversarial FGSM

By wearing special and tailored eyeglasses, the persons in the top rows are all misidentified as the well-known people in the second row.

Press enter or click to view image in full size
Source

The essence is can we design small enough noise (the middle column) that has no visual impacts on the source but causes the result images (the right column) to be misclassified.

Press enter or click to view image in full size
Source

This technique applies pixel-wise changes along the maximum loss gradient direction. So, the logits score for the ground truth class will be significantly dropped with little or no visual changes. In this example, we will use a pre-trained MobileNet V2 as the classifier.

Press enter or click to view image in full size

Prepare a Labrador image:

Press enter or click to view image in full size

With one single gradient step, we can permute the image to be misclassified already. In this example, we play with different values of ε (the higher the value, the more visual distortion to the image).

Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Source

Autoencoder

Autoencoder uses an encoder to extract latent features z and uses a decoder to reconstruct the image.

Press enter or click to view image in full size

Data setup:

Press enter or click to view image in full size

Here are the autoencoder and the training.

Press enter or click to view image in full size

Original and reconstructed images:

Press enter or click to view image in full size
Source

Autoencoder — Denoising

We can use an autoencoder to denoise images.

Press enter or click to view image in full size
Source

First, we need to create noisy images from the training images. These images will be used as source images during training.

Press enter or click to view image in full size

The idea is simple. Our source images are the noisy images and the target images are the original images. We train an autoencoder to convert noisy images to clean images. The autoencoder is trained to extract features without noise.

Press enter or click to view image in full size

Autoencoder — Anomaly detection

ECG records the electrical activity generated by the heart.

Press enter or click to view image in full size

In this dataset, it contains a CSV file with normal and anomalous ECG samples. First, we read the samples and prepare them.

Press enter or click to view image in full size

The AnonalyDetector is a simple autoencoder. We train the data with normal ECG only, expecting it to reconstruct the normal ECG nicely. We use the MAE (mean absolute error) as the loss function.

Press enter or click to view image in full size

But when we reconstruct the data with the anomalous ECG, we expect it to have a much higher MAE reconstruction loss. We do not expect the autoencoder to extract features present in Anomalous ECG but not in normal ECG. Therefore, the autoencoder will have a harder time reconstructing them.

Press enter or click to view image in full size
Source

We find the MAE reconstruction losses for all the normal ECG and compute its standard deviation. In the example below, we compute the reconstruction loss for all the anomalous ECG. When the error is higher than one standard deviation over this threshold, we flag it as anomalous.

Press enter or click to view image in full size

In this example, it also prints out the accuracy, precision, and recalls for the anomalous ECG samples.

Press enter or click to view image in full size

Convolutional Variational Autoencoder (VAE)

This example trains a Variational Autoencoder (VAE) in generating MNIST digits. In contrast to the autoencoder, VAE generates the parameters of a probability distribution on latent features instead of latent features. Modeling a probability distribution, instead of a point estimate, may capture and reason uncertainty in real-life problems better.

Press enter or click to view image in full size

If we assume the latent feature z has a distribution of a multivariate Gaussian with a diagonal covariance matrix, the encoder predicts the mean and the variance for each feature in z. To reconstruct the image, we sample z from the predicted distribution. For example,

Press enter or click to view image in full size

And then, we decode z with the decoder.

Reparameterization trick

So how can we perform backpropagation when one of the operations is sampling? We can formulate the objective with expectation and differentiate it.

Press enter or click to view image in full size

The calculation will involve the gradient of the log probability (underline in red above). Unfortunately, many distributions are not differentiable or the gradient is hard to compute. To address that, we apply a technique called the reparameterization trick.

Press enter or click to view image in full size

It simply moves the sampling operation out of the backpropagation path of the trainable parameters. It changes the original sampling operation into x = μ + σ ε where ε is sampled from Ɲ(0, 1). Now, the equation for x is simple and easy to differentiate. And ε is sampled outside the backpropagation path of μ and σ. So we don’t need to compute the corresponding gradient.

VAE Loss function ELBO (optional)

So what is the loss function for VAE? We can summarize it mathematically as:

Press enter or click to view image in full size
Press enter or click to view image in full size

where

Press enter or click to view image in full size

This sounds very hard, so let’s explain it one step at a time and derive the VAE loss function accordingly. KL Divergence measures the difference between two distributions and it is defined as:

Press enter or click to view image in full size

Let’s replacing (x) with (z|x),

Press enter or click to view image in full size

Encoder q is parameterized by 𝜙 and predicts the distribution of z given x.

Press enter or click to view image in full size

After applying Bayes’ Theorem

Press enter or click to view image in full size

and some simple shuffling, the KL divergence becomes (proof):

Press enter or click to view image in full size

or

Press enter or click to view image in full size

Let’s name the last two terms ELBO. Because KL-divergence is always greater or equal to 0, the log evidence (log p(x)) is always greater than ELBO.

Press enter or click to view image in full size

That is why it is called evidence lower bound (ELBO). Since log p(x) is constant w.r.t. 𝜙, the KL-divergence adds up with ELBO to a constant — KL-divergence drops when ELBO increases. Minimizing the KL-divergence is the same as maximizing ELBO.

Press enter or click to view image in full size

But how can we compute the ground truth p(z) and p(x|z)? We introduce the latent factor z and we have a great deal of freedom in making further assumptions or constraints. For example, we can assume z has a standard normal distribution Ɲ(0, 1). So the KL term can be computed analytically. For instance,

Press enter or click to view image in full size

To estimate p(x|z), we model it with a model p parameterized by θ.

Press enter or click to view image in full size

So the VAE objective becomes:

Press enter or click to view image in full size

The first term in ELBO is a generation loss measuring how well we reconstruct the image. The second term is a latent loss penalizes whenever

Press enter or click to view image in full size

The latent loss ensures the z distribution from the encoder to be close to our prior belief p(z).

Press enter or click to view image in full size

In particular, our decoder generates pixel values, not probabilities. But, we do not build a new model to estimate p(x| z). Instead, we repurpose the decoder for this. In our example, the image is binary (black and white). We can feed the logit outputs of the decoder to a sigmoid function to generate probability values. Then, we treat the problem as a binary classification and the generation loss is simply the cross-entropy loss.

How can we use the decoder in general for this purpose? The objective of the generation loss is to maximize the log-likelihood of x given z. We want to optimize an encoder such that the extracted latent factors are adequate enough to rebuild x. But we can expand the scope a little more. We can rephrase this objective to be minimizing the reconstruction loss. As an image passes through the encoder and the decoder, we want to optimize both the encoder and the decoder to have the smallest reconstruction loss. Therefore, in some model training, the generation loss is replaced by the MSE of the reconstructed images instead.

Coding

Dataset setup:

Press enter or click to view image in full size

This is the first part of the VAE model that contains an encoder and a decoder using convolution and transpose convolution.

Press enter or click to view image in full size

This is the rest of the VAE class. “sample” samples latent features from a standard normal distribution and decodes them. “encode” predicts the means and the log variances of the features z from input x. “reparameterize” samples values using the reparameterization trick.

Press enter or click to view image in full size

We will use the ELBO discussed before as the objective function.

Press enter or click to view image in full size

While we provide the theoretical equations in computing ELBO, here is what we do in practice using Monte Carlo (sample z and compute the expected values). We sample z from the distribution predicted by the encoder and compute the ELBO value below.

Press enter or click to view image in full size
  • p(x|z): the logit prediction of the decoder will pass through a sigmoid function to predict this probability.
  • p(z): p(z) is assumed to have a standard normal distribution. Given a sampled z, we find the corresponding pdf value for z under this distribution.
  • q(z|x): Given the distribution predicted by the encoder, we compute the corresponding pdf value for the sampled z.

log_normal_pdf computes the log probability for “sample” that belongs to a normal distribution with mean and logvar. Then, compute_loss computes the ELBO.

Press enter or click to view image in full size

Here is the training step. Our latent feature z is configured to have a dimension of 2. We also generate 16 random testing samples on z to be used later.

Press enter or click to view image in full size

generate_and_save_images takes in an image, encodes it, and decodes it. We save the reconstructed images such that we can accumulate them and replay them later as an animated GIF. We also pick 16 testing images for later use.

Press enter or click to view image in full size

We train the model and for each epoch, we save 16 reconstructed images to visualize the training.

Press enter or click to view image in full size

After the training is done, we stitch all the reconstructed images together as a GIF file to show the training progress.

Press enter or click to view image in full size

The latent factor z in this example has 2 components. We change them gradually and plot out the corresponding generated images. This technique gives us a big picture of what these latent factors are controlling.

Press enter or click to view image in full size
Press enter or click to view image in full size

DeepDream

DeepDream is about exaggerating features in an image. Our objective is to apply gradient ascend to

Press enter or click to view image in full size

such that we change the image to boost the highly activated features in a layer. Higher activation triggers a higher boost. The feature extraction is similar to the neural style transfer. If we choose a style layer(s) to boost, we exaggerate the style. If we choose a content layer(s), we boost the content structure (like the eye).

Press enter or click to view image in full size

Image setup:

Press enter or click to view image in full size

We use Inception V3 for extraction and pre-select two layers to boost its activation. We compute the loss based on the activation.

Press enter or click to view image in full size

Our training applies gradient ascend to exaggerate selected-layer activations.

Press enter or click to view image in full size

Here is the training.

Press enter or click to view image in full size

To improve quality and resolution, we gradually upsampling the images and at each resolution, we apply DeepDream again before the upsampling.

Press enter or click to view image in full size

But apply a transformation to a large upsampled image can be expensive. Alternatively, we can upsample an image, and divide the image into tiles and apply the transformations to the tiles separately. But to avoid the edge effect between tiles, we will roll an image randomly before breaking it into tiles and then unroll it back.

Press enter or click to view image in full size
Press enter or click to view image in full size

Here is the model to perform the titled training.

Press enter or click to view image in full size

And this is the training loop.

Press enter or click to view image in full size

Credits and References

All the source code is originated or modified from the TensorFlow tutorial.

--

--

No responses yet