TensorFlow Dataset & Data Preparation

13 min readJan 6, 2021

In this article, we discuss how to use TensorFlow (TF) Dataset to build efficient data pipelines for training and evaluation. But, if the training data is small, we can fit the data into memory and preprocess them as Numpy ndarry. However, many real-life datasets are too large. To scale the solution, we create data pipelines (tf.data.Dataset) using features like preprocessing, prefetching, and caching in scaling up the solution.

MINST Data Preprocessing with NumPy

Nevertheless, toy datasets remain important in developing novel models and algorithms. So, let’s start our discussion with these small datasets, like MNIST.

Keras provides APIs in loading popular datasets into memory as NumPy ndarray. When a ndarray is passed into a TF operation, it will be converted to a TF Tensor automatically (or vice versa). If possible, TF maintains the same underlying memory representation for Numpy ndarray and Tensors. So the conversion is cheap. Nevertheless, if the Tensor is hosted in GPU, the conversion requires a data copy to the CPU host since a NumPy ndarray always has a copy in the host memory. Be aware of performance impacts if it happens un-necessary.

For small image datasets, we load them into memory, rescale them, and reshape the ndarray into a shape required by the first deep learning layer. For example, a convolution layer has an input shape of (batch size, width, height, channels) while a dense layer is (batch size, width × height × channels). The following code rescales MNIST pixel values between 0 and 1 and we prepare the data for a convolution layer. So we will extend the ndarray from (60,000, 28, 28) to (60,000, 28, 28, 1) using […, tf.newaxis]. Alternatively, we can use reshape to change the shape of a ndarray.

Here is the CNN model for your reference.

In line 29 below, we prepare the input data to be consumed by a dense layer instead.

Then, we use model.compile to configure the model for training.

Next, model.fit trains the model below for 10 epochs using the training images and labels that we prepare before. When the input data to model.fit is a ndarray, data is trained in mini-batches. By default, the batch size (batch_size) is 32. In addition, with validation_split=0.1, we reserve the last 10% of the training samples for validation.

We can also partition the training samples and pass the validation data explicitly.

Here are the remaining code for making evaluations and predictions.

model.fit prepares the mini-batches automatically when the input data is a ndarry. After discussing small datasets, let’s see how we use Dataset to scale up data loading and processing.

tf.data.Dataset Basic

As a simple demonstration, the code below creates 4 data samples using a Python list.

dataset.from_tensor_slices slices its content along the first dimension.

During early development, we often fabricate data to test out logic. In the code below, we feed uniform data of different data shape into a dataset.

In line 14, ds2 generates tuples — the first element contains 100 features and the second element contains a label. We can also use multiple datasets to create a new dataset. For example, the code below creates a new dataset that supplies tuples with the first and second elements coming from ds2 and ds respectively.

We can example what a dataset generates by examining the first data sample with take(1).

We can also define a sparse tensor. The indices below are the list of elements with non-zero values and values store the corresponding values.

Mini-Batch (dataset.batch)

Dataset prepared so far generates one sample at a time. Nevertheless, training is usually done in mini-batches. In the example below, dataset.batch(size) create a dataset generating mini-batches that contain 4 samples each.

Cache & Prefetch

To improve the performance of the dataset pipeline, it is a general practice to chain a pipeline with caching and prefetch. Caching allows data to cache in memory and local storage, and prefetch allows multiple threads of data pre-loading.

Here is the usual ordering of the chain including shuffle and batch. (We will discuss shuffle later.)

Dataset with NumPy ndarray

We can wrap around NumPy ndarray to form a Dataset and take advantage of the Dataset APIs. In the code below, we prepare a dataset with samples coming from Fashion MNIST.

Specifically, we construct a dataset in lines 27 and 28. Instead of generating one sample at a time, dataset.batch(32) produces mini-batches.

fmnist_train_ds generates a tuple with the first element contains 32 images and the second element contains 32 labels.

And we can pass the dataset, which contains both images and labels, to model.fit, model.evaluate, and model.predict directly. For each training iteration, model.fit will consume a mini-batch from the dataset.

Dataset with NumPy ndarray & Keras Preprocessing Layers

In the previous section, data preprocessing is performed before model fitting. Keras Preprocessing Layers offer an option to perform this processing inside a model. In the example below, the rescaling and the reshaping are defined as layers inside a model. If the model is run in the graph mode, it can be optimized with other operations and to be run on GPUs. Here is the code but we will defer the discussion of Keras Preprocessing Layers in a separate article.

TensorFlow Datasets

TensorFlow also provides a large category of datasets to be downloaded and loaded as Dataset.

TFRecord Dataset

Some TF projects save data in the form of TFRecord. It is a binary storage format for TF that stores a sequence of binary records. This binary format is more condense comparing to the text format. Because it is binary with native support in TF, some projects particularly in NLP save huge datasets into TFRecord files such that it can be read more efficiently during training.

A TFRecord file contains serialized tf.train.Example messages in the form of {“string”: value} pairs. Below is a visualization of the Example messages in the text form which including image metadata (width, height) and image data.

Let’s see how can we construct a dataset from TFRecord files. First, we create a dataset that composes of a list of TFRecord files.

Then, we can use tf.io.parse_example to convert the tf.train.Example messages into Tensor, SparseTensor, or RaggedTensor objects. The dataset below creates a Tensor “eg” of string for each sample and we use tf.io.parse_example to extract the image data and the label from it.

The code below manually examines a sample from a dataset and displays the corresponding image and label.

As a demonstration, the diagram on the left is the visualization of features containing in a TFRecord file. The right is the data extracted by tf.io.parse_example.

Here is another example.

Text Line Dataset

Data samples may be contained in multiple directories containing many files. In this section, we will create a dataset from many text files. And we will use TextLineDataset to create samples in which each contains a single line of text.

The code below interleaves data from different files — picking a line of text from each file in a rotation manner. This allows the sampled data to be more random for the training.

We can also use “skip” to skip the header of a file and apply a filter in selecting what lines of text to be used.

CSV Dataset

Let’s create datasets from CSV data. Each data sample will contain a dictionary of features, say survived: 0, sex: male, age: 22.0, … for the Titanic CSV dataset. If the data size is not huge, we can read it into memory with Pandas and then create a dataset with from_tensor_slices.

However, for large data samples, we want to read samples from disk on-demand. In line 85, we use tf.data.experimental.make_csv_dataset to load samples from a file gradually.

Dataset — One file for each sample

Next, we want to handle multiple files — one file for each sample in particular. In line 117 below, we create a dataset that contains filenames from different directories and each directory will hold a particular class of samples. Then in line 127, we create another dataset on top. This new dataset contains the images and associated labels — we will use process_path to map data in a file into raw image data and labels. Since images belonging to the same class store in the same directory, we simply use an image parent directory name as its class label.

The code above provides fine control in creating datasets from multiple files. But Keras preprocessing also provides a simpler high-level API for images. First, we download a zip file and uncompresses the files into:

Then, Keras preprocessing can create datasets directly from these directories using preprocessing.image_dataset_from_directory. It treats the contents of a file as a single sample.

Note: The training dataset and validation dataset above are originated from the same source, data_dir. To avoid sample overlap, they should use the same seed value or shuffle=False.

There is another version called preprocessing.text_dataset_from_directory for text files.

Padding

In many RNN models, the input time sequence has a variable length. Actually, the built-in RNN models in TF have no problem with variable-length input. The issue is in training which takes a Tensor. So. in each training step, all the samples in the same batch must have the same number sequence (a fixed-length sequence). This can be achieved by creating a dataset using padded_batch below. It finds the longest sequence length within a batch and extends all samples within a batch to this length with padded 0. For example, in the first batch below, the longest sequence has 4 timesteps and therefore all samples are extended to 4 timesteps. But for the second batch, the longest sequence is 8, and therefore, all samples are padded to a length of 8.

drop_remainder

Sometimes, the last mini-batch in a dataset may not contain a full batch. We can use drop_remainder to skip the last mini-batch instead.

Repeat

We can also repeat the data in a dataset to supply (repeat) more data.

The original dataset in 152 contains 3 samples only. By repeating it 3 times in line 153, we have 9 samples now. With a batch size of 2, the new dataset generates 5 mini-batches.

If the initial dataset is small, we do want to call repeat before batch (or shuffle) such that only the last mini-batch may have a size smaller than the batch size. Otherwise, you may have a smaller mini-batch at the end of every epoch.

Shuffle

If data in a dataset is ordered or highly correlated, we want them to be shuffled first before the training. In the example below, we have a dataset containing an ordered sequence of numbers from 0 to 99. This example will shuffle the data with a buffer of size 3. First, the first 3 elements of the dataset (0, 1, 2) are put into the buffer. We randomly select an element in the buffer as the next sampled data and replace it with the next element in ds. Let’s trace through this example. First, say 0 is randomly selected, the buffer becomes (3, 1, 2) now. Next, say 2 is selected, the buffer becomes (3, 1, 4). 1 is randomly selected as the third data and the buffer becomes (3, 5, 4). So far, the dataset generates 0, 2, 1. As demonstrated, if the data is highly ordered, we do need a much larger shuffling buffer, comparable to the data size, for shuffling.

Preprocessing data with Map

Next, we will create a dataset with more complex preprocessing. In the code below, we use “map” to process data further. list_ds dataset supplies a stream of filenames. With “map”, we create a new dataset that processes this stream of files into a stream of images and labels. Passing the method parse_image in “map”, it read the image data from the file, decode it and resize it to 128 × 128. In the end, it returns the preprocess image data and the label. “resize” below will distort the image if the input image does not have an aspect ratio of 1 to 1. Use resize_with_pad to pad the image with zeros to keep the original aspect ratio if desired.

In the code below, we map the input using an arbitrary rotation in scipy.ndimage to augment the data. In this process, a new dataset with arbitrarily rotated images is created. However, ndimage is not a TF library. It has no problem if the code is run in eager execution. But in graph mode, we need to wrap it with tf.py_function first. Unfortunately, using tf.py_function will have a catch: this will not work well with a distributed environment using multiple GPU or TPUs.

Time Sequence Data (Optional)

The labels for the time sequence model may be originated or transformed from the input features. In this section, we will demonstrate different ways to manipulate datasets in creating labels related to the source. First, let’s fabricate a dataset with data consists of increasing numbers.

In the first example, we drop the last timestep to create a new input and time shift the original input by one timestep to the left to create the labels.

Here, we use the first 10 entries in range_ds as data and then the next 5 entries as labels.

In this example, the input features are consecutive. Then for every 15 features, we use the last five entries as labels.

window

range_ds.window below creates a dataset of datasets, i.e. windows below is a dataset contain many datasets (sub-dataset). In line 14, we pick the first sub-dataset from windows and sample data from it in line 15. Since we use a window size of 6, the sub-dataset “window” in line 15 composed of 6 samples. And the first sub-dataset contains samples 0, 1, 2, 3, 4, 5, and 6.

Next, we want to demonstrate the use of windows to create sliding data. flat_map takes a dataset of datasets (windows) and flatten it into a single dataset. The new dataset will output batches from dataset1 first. Once it is exhausted, it moves to the next dataset in windows.

In line 18 below, we collect 5 batches of samples from the flattened dataset. The first batch samples come from the first dataset and the second batch from the second dataset.

But in line 17 below, we configure the sub-dataset to have a batch size of 3 only. So, each sub-dataset can produces 2 batches and the diagram on the right is the first 5 batches sampled with the flattened dataset. Set drop_remainder=True in line 17 if we don’t want the last batch to be smaller than 3.

As shown below, dataset.window also allows us to control the stride and the shift when sampling the data. “Shift” represents the number of input elements by which the window moves in each iteration in creating a sub-dataset. “Stride” represents the stride of the input elements in the sliding window (sub-dataset).

Credits and References

TensorFlow tutorial

TensorFlow guide