TensorFlow Keras Preprocessing Layers & Dataset Performance

Jonathan Hui
13 min readJan 10, 2021


While Keras provides deep learning layers to create models, it also provides APIs to preprocessing data. For example, preprocessing.Normalization() normalizes features with the feature means and the variances of the training dataset. These parameters are not part of the trainable parameters. Instead, in line 18, we “adapt” the normalization layer to the training samples (“data”). This method will calculate the means and the variances automatically.

In our second example, we adapt the TextVectorization layer to the corpus “data”. It creates a vocabulary and mappings to the integer word indexes — the ith word in the vocabulary will map to the integer i. These mappings transform a text into a sequence of integer word indices. “ data” contains 8 samples and it will be converted to 8 integer vectors when it is processed by the TextVectorization layer.

But sometimes, we bypass the “adapt” call. Instead, we pass the vocabulary directly when the preprocessing layer is instantiated.

For example, during training, we adapt a layer to the training samples. Then, we extract the needed information (say, the vocabulary) from the layer. During production, we reload this information to recreate the same preprocessing layer without the original training dataset.

Preprocessing Layers Provided

The following are the different built-in preprocessing layers. They include image processing, image data augmentation, and structured data preprocessing.

Rescaling and resizing are common operations in preprocessing imaging data. In the code below, we apply these preprocessing as a Keras layer inside a model.

Model Layers v.s. Preprocessing Dataset

Actually, Keras preprocessing provides two different options in applying the data transformation.

preprocessing_layer is a Keras layer like preprocessing.Normalization

In option 1, the preprocessing layer is part of the model. It is part of the model computational graph that can be optimized and executed on a device like a GPU. This is the best option for the Normalization layer and all image preprocessing and data augmentation layers if GPU(s) are available.

Option 2 uses dataset.map to convert data in the dataset. Data augmentation will happen asynchronously on the CPU and is non-blocking. Its key focus is to take advantage of multiple threading in the CPU. With dataset.prefetch, the preprocessing can overlap with each other and with the model training on the GPU. Nevertheless, the preprocessing logic will not be exported in the SavedModel. It has to be replicated on the server during production deployment. Option 2 is good for TextVectorization, structured data preprocessing, and when GPU(s) are not available for image preprocessing layers.

Note: dataset.map runs in graph mode. It is just not part of a model. That is why when you put a breakpoint inside a mapping function, it would not stop!

Export the model

For cases where we choose option 2 for training, we can still deploy the solution in production using option 1. Actually, it can be done relatively easily by creating a new model including the preprocessing layer.

After training, we create this new model and save the model as SavedModel. The saved model contains the computational graph that includes the preprocessing layer and its parameters, like means and variances in the Normalization layer. By packaging everything as a single unit, we save the effort in reimplementing the preprocessing logic on the production server. The new model can take raw text directly without preprocessing. This avoids missing or incorrect configuration for the preprocesing_layer during production.

If we stay with option 2 in production, we need to initialize the TextVectorization with the same vocabulary used by training. After a TextVectorization layer is adapted, we can retrieve the vocabulary with the layer.get_vocabulary method and save it.

To deploy a model in production, we initialize a new TextVectorization layer with the same vocabulary. For example, in line 66 below, we should use the path of the saved vocabulary instead.

Here is another way to configure the TextVectorization using set_vocabulary where vocab is an array of words.

Let’s go through some examples of how to use them.

Image data augmentation

Keras' built-in preprocessing layer can be used to augment the image data. In the code below, we randomly flip, rotate, and zoom the input image. In this example, we also apply a rescaling layer to rescale the input values. Because these layers can be part of the computational graph, the preprocessing code can run on GPUs.

If the data augmentation layers are integrated into the model (option 1), it will be done in model.fit during training only. There is no impact to the image for model.evaluate or model.predict.

For option 2, we will have a dataset without the data augmentation and another one on top of it with the augmentation mapping. Use the first dataset in training and the second for inference.

Custom Data Augmentation

We can write our own data augmentation layer. In the code below, it randomly inverts an image and implemented as a layers.Lamda layer.

Or we can implement it by subclassing the keras.layers.Layer.

Both implementations can be used for both options 1 and 2.


We can also use APIs in tf.image to augment the images. For example, we can flip, rotate, saturate, etc… the images.

In the code below, we prepare 3 datasets — training, validation, and testing. For the training dataset, we apply resize, rescale, and augmentation to the dataset with dataset.map. For validation and testing datasets, we just apply resize and rescale to the images.

Normalizing features

In the code below, we load the cifar10 dataset and flatten the tensor. The shape of x_train in line 57 becomes (50000, 3072) now. Then we create a Normalization layer adapted to x_train in generating the mean and the variance. Finally, from line 64 to 67, we create a model with a normalization layer to normalize input.

Here is another example in which we use Normalization to normalize features in a CSV file.

Encoding categorical features via one-hot encoding

In this example, we convert categorical features into a one-hot encoding. First, we adapt StringLookup to a corpus such that a string (a word) can be represented as an index to a vocabulary, such as 4 for “a”. Then we use CategoryEncoding to convert the index to a one-hot vector. So “a” becomes [0, 0, 0, 0, 1]. “a”, “b” and “c” can each represents a category. Therefore, the constructed layer converts category features into one-hot vectors.

Encoding integer categorical features via one-hot encoding

Our next example is similar to the previous one which converts an input into a one-hot vector. But the corpus contains numbers instead of words.

Applying hashing to an integer categorical feature (optional)

For an input categorical feature that can take many different values (on the order of 10,000 or higher) and each value may only appear a few times in the training data, it becomes ineffective to use one-hot-vector to encode the input feature. Hashing allows us to map these features to one of the bins in a hash first. Then, we apply CategoryEncoding to the hashed value to create a one-hot vector. In the example below, the Hashing layer has 64 bins. Therefore, after CategoryEncoding, each sample will be represented by a 64-D one-hot vector.


Given a vocabulary for a corpus “data” below, we can represent a word using an index to a vocabulary instead of the string itself.

For example, with this corpus, we can generate an index of each word and use the index to represent a word.

In this process, two special tokens are created. 0 is for the padding to convert variable-length time sequence input to fixed-length input. 1 is as [UNK] for out of vocabulary (OOV) words. For instance, during inference, we may encounter words that never occur in the corpus, and therefore, we mark it as [UNK]. On the other hand, we may limit the maximum vocabulary size (using max_tokens in TextVectorization) and therefore, rare words in the vocabulary will be indexed as [UNK].

In the example below, we use TextVectroization to convert each word in a sentence into an integer index to a vocabulary. Given the sentence “The Brain is deeper than the sea", TextVectroization converts it to an array containing 7 integer indexes.

Then it is feed into an Embedding layer that outputs a 64-D vector for each word. Therefore, our sentence will have an output shape of (7, 64). Then it is connected to an LSTM that outputs a single 5-components vector for this sentence. Here is the code.

Given “The Brain is deeper than the sea”, the whole sentence is encoded as a 5-component vector:


By default, the output_mode in TextVectorization is “int” — use an integer index to a vocabulary to represent each word. But there are other configurations. When it is “binary”, it encodes the input sentence as a bag of words with the vector length equals the vocabulary size (or max_tokens). Each corresponding entry will mark the present (1) or absent (0) of a word in the input.

Or the output_mode can be set as “count” in which each entry marks the term frequency (tf) of a word.


In previous examples, our vocabulary contains a single word only (the first line below).

n-gram means n consecutive words. bigram means 2-gram for two consecutive words. When ngrams=2 in TextVectorization, our vocabulary will contain both single words and bigrams for the corpus. With output_mode=”binary” and ngrams=2, the final vocabulary (the second line above) will contain 41 entries for the corpus below.

Encoding text as a dense matrix of ngrams with multi-hot encoding

Here is an example of using bigrams to build up the vocabulary and use binary for the output (bag of words).

“The Brain is deeper than the sea” is transformed into a 41-D vector containing 0 or 1 below.

The model uses a dense model to convert this representation into a scalar value (0.7497798 for our sentence example).

Encoding text as a dense matrix of ngrams with TF-IDF weighting

When output_mode in TextVectorization is “binary”, the output represents a bag of words. When it is “count”, each entry counts the term frequency. For output_mode equals “tf-idf”, the entry is computed by the term frequency-inverse document frequency below.

Here is the code to build up a scalar representation for a sentence using tf-idf encoding in TextVectorization and a dense layer.

cache & prefetch

To improve data throughput, we can apply cache and prefetch to the dataset. dataset.cache allows data to be cached in memory or on local storage.

With caching, we can eliminate the potential opening and reading of data after the first epoch.


We can increase the sample throughput with prefetching also. In prefetching (the second diagram below), a background thread starts prefetching data for the next training step even before the last training step is finished. As shown in the second diagram, operations in the training and read will be overlapped to reduce the unnecessary wait for the data.


With this performance improvement, it is a common practice to prepare a dataset with cache and prefetch.

Here is the ordering of the chain including shuffle and batch.

interleave (parallelizing data extraction)

dataset.interleave parallelizes the data loading and interleaves the contents of the datasets (such as data file readers). cycle_length controls the number of input elements that will be processed concurrently. By setting it to 4, we allow 4 datasets to be used. In the code below, it creates 4 datasets that each responsible for a file.

If we reduce cycle_length to 2, there will be only 2 datasets. Data from file3 and file4 will not be loaded until file1 and file2 are finished.

But the real parallelism comes from the num_parallel_calls settings. It controls the number of threads to fetch inputs. If it is one, the 4 datasets are just rotated and waiting for their turn in loading data. By setting num_parallel_calls to 4, we allow them to load data in parallel. num_parallel_calls can also set to tf.data.AUTOUNE. This allows TF to set it automatically according to the CPUs capacity.

Below is another example of using 2 datasets in generating samples.

With num_parallel_calls greater than one, multiple threads read concurrently for multiple datasets.

Modified from source

Parallelizing data transformation

We can also parallelize the data mapping in data preprocessing. In dataset.map, we set num_parallel_calls to AUTOTUNE to allow parallel reading and decoding on audio files.

Vectorizing mapping

Processing data in vector will be more efficient than processing them as scalar one-by-one. In the first example below, we preprocessing the sample first before batching it. But in the second example, we batch 256 samples together before the mapping. And the mapping will be performed as a Tensor vector which is more efficient. So even both produce the same samples, the later one is 10x faster. So be careful about the order of batch and map.

Map & Cache

In general, we want to put cache after map to avoid time-consuming mapping unless the data generate from “map” is too large.

Sample Rebalance

Class imbalance hurts deep learning training. The first 10 batches of the credit card dataset below has 99.57% of the samples belonging to class 0.

To rebalance the dataset, we can apply filters to create two separate datasets that hold class 0 and class 1 samples separately. Then we create a balanced dataset by specifying the mix as [0.5, 0.5], i.e. the new dataset has an equal number of samples from both classes.

Rejection resampling

However, data is loaded twice — one for each filter. An alternative solution is to configure a rejection resampler specifying the target class distribution. The class_func returns the class label part of the data coming from creditcard_ds. The resampler below (line 360) will use it to balance samples for the new dataset. Once the resample dataset is created, we create a new one with the duplicated labels stripped (line 369).

Credits and References

All the source code is originated or modified from the TensorFlow tutorial.



Jonathan Hui