TensorFlow Dataset & Data Preparation

In this article, we discuss how to use TensorFlow (TF) Dataset to build efficient data pipelines for training and evaluation. But, if the training data is small, we can fit the data into memory and preprocess them as Numpy ndarry. However, many real-life datasets are too large. To scale the solution, we create data pipelines (tf.data.Dataset) using features like preprocessing, prefetching, and caching in scaling up the solution.

MINST Data Preprocessing with NumPy

Keras provides APIs in loading popular datasets into memory as NumPy ndarray. When a ndarray is passed into a TF operation, it will be converted to a TF Tensor automatically (or vice versa). If possible, TF maintains the same underlying memory representation for Numpy ndarray and Tensors. So the conversion is cheap. Nevertheless, if the Tensor is hosted in GPU, the conversion requires a data copy to the CPU host since a NumPy ndarray always has a copy in the host memory. Be aware of performance impacts if it happens un-necessary.

For small image datasets, we load them into memory, rescale them, and reshape the ndarray into a shape required by the first deep learning layer. For example, a convolution layer has an input shape of (batch size, width, height, channels) while a dense layer is (batch size, width × height × channels). The following code rescales MNIST pixel values between 0 and 1 and we prepare the data for a convolution layer. So we will extend the ndarray from (60,000, 28, 28) to (60,000, 28, 28, 1) using […, tf.newaxis]. Alternatively, we can use reshape to change the shape of a ndarray.

Here is the CNN model for your reference.

In line 29 below, we prepare the input data to be consumed by a dense layer instead.

Then, we use model.compile to configure the model for training.

Next, model.fit trains the model below for 10 epochs using the training images and labels that we prepare before. When the input data to model.fit is a ndarray, data is trained in mini-batches. By default, the batch size (batch_size) is 32. In addition, with validation_split=0.1, we reserve the last 10% of the training samples for validation.

We can also partition the training samples and pass the validation data explicitly.

Here are the remaining code for making evaluations and predictions.

model.fit prepares the mini-batches automatically when the input data is a ndarry. After discussing small datasets, let’s see how we use Dataset to scale up data loading and processing.

tf.data.Dataset Basic

dataset.from_tensor_slices slices its content along the first dimension.

During early development, we often fabricate data to test out logic. In the code below, we feed uniform data of different data shape into a dataset.

In line 14, ds2 generates tuples — the first element contains 100 features and the second element contains a label. We can also use multiple datasets to create a new dataset. For example, the code below creates a new dataset that supplies tuples with the first and second elements coming from ds2 and ds respectively.

We can example what a dataset generates by examining the first data sample with take(1).

We can also define a sparse tensor. The indices below are the list of elements with non-zero values and values store the corresponding values.

Mini-Batch (dataset.batch)

Cache & Prefetch

Here is the usual ordering of the chain including shuffle and batch. (We will discuss shuffle later.)

Dataset with NumPy ndarray

Specifically, we construct a dataset in lines 27 and 28. Instead of generating one sample at a time, dataset.batch(32) produces mini-batches.

fmnist_train_ds generates a tuple with the first element contains 32 images and the second element contains 32 labels.

And we can pass the dataset, which contains both images and labels, to model.fit, model.evaluate, and model.predict directly. For each training iteration, model.fit will consume a mini-batch from the dataset.

Dataset with NumPy ndarray & Keras Preprocessing Layers

TensorFlow Datasets

TFRecord Dataset

A TFRecord file contains serialized tf.train.Example messages in the form of {“string”: value} pairs. Below is a visualization of the Example messages in the text form which including image metadata (width, height) and image data.

Let’s see how can we construct a dataset from TFRecord files. First, we create a dataset that composes of a list of TFRecord files.

Then, we can use tf.io.parse_example to convert the tf.train.Example messages into Tensor, SparseTensor, or RaggedTensor objects. The dataset below creates a Tensor “eg” of string for each sample and we use tf.io.parse_example to extract the image data and the label from it.

The code below manually examines a sample from a dataset and displays the corresponding image and label.

As a demonstration, the diagram on the left is the visualization of features containing in a TFRecord file. The right is the data extracted by tf.io.parse_example.

Here is another example.

Text Line Dataset

The code below interleaves data from different files — picking a line of text from each file in a rotation manner. This allows the sampled data to be more random for the training.

We can also use “skip” to skip the header of a file and apply a filter in selecting what lines of text to be used.

CSV Dataset

However, for large data samples, we want to read samples from disk on-demand. In line 85, we use tf.data.experimental.make_csv_dataset to load samples from a file gradually.

Dataset — One file for each sample

The code above provides fine control in creating datasets from multiple files. But Keras preprocessing also provides a simpler high-level API for images. First, we download a zip file and uncompresses the files into:

Then, Keras preprocessing can create datasets directly from these directories using preprocessing.image_dataset_from_directory. It treats the contents of a file as a single sample.

Note: The training dataset and validation dataset above are originated from the same source, data_dir. To avoid sample overlap, they should use the same seed value or shuffle=False.

There is another version called preprocessing.text_dataset_from_directory for text files.

Padding

drop_remainder

Sometimes, the last mini-batch in a dataset may not contain a full batch. We can use drop_remainder to skip the last mini-batch instead.

Repeat

The original dataset in 152 contains 3 samples only. By repeating it 3 times in line 153, we have 9 samples now. With a batch size of 2, the new dataset generates 5 mini-batches.

If the initial dataset is small, we do want to call repeat before batch (or shuffle) such that only the last mini-batch may have a size smaller than the batch size. Otherwise, you may have a smaller mini-batch at the end of every epoch.

Shuffle

Preprocessing data with Map

In the code below, we map the input using an arbitrary rotation in scipy.ndimage to augment the data. In this process, a new dataset with arbitrarily rotated images is created. However, ndimage is not a TF library. It has no problem if the code is run in eager execution. But in graph mode, we need to wrap it with tf.py_function first. Unfortunately, using tf.py_function will have a catch: this will not work well with a distributed environment using multiple GPU or TPUs.

Time Sequence Data (Optional)

In the first example, we drop the last timestep to create a new input and time shift the original input by one timestep to the left to create the labels.

Here, we use the first 10 entries in range_ds as data and then the next 5 entries as labels.

In this example, the input features are consecutive. Then for every 15 features, we use the last five entries as labels.

window

range_ds.window below creates a dataset of datasets, i.e. windows below is a dataset contain many datasets (sub-dataset). In line 14, we pick the first sub-dataset from windows and sample data from it in line 15. Since we use a window size of 6, the sub-dataset “window” in line 15 composed of 6 samples. And the first sub-dataset contains samples 0, 1, 2, 3, 4, 5, and 6.

Next, we want to demonstrate the use of windows to create sliding data. flat_map takes a dataset of datasets (windows) and flatten it into a single dataset. The new dataset will output batches from dataset1 first. Once it is exhausted, it moves to the next dataset in windows.

In line 18 below, we collect 5 batches of samples from the flattened dataset. The first batch samples come from the first dataset and the second batch from the second dataset.

But in line 17 below, we configure the sub-dataset to have a batch size of 3 only. So, each sub-dataset can produces 2 batches and the diagram on the right is the first 5 batches sampled with the flattened dataset. Set drop_remainder=True in line 17 if we don’t want the last batch to be smaller than 3.

As shown below, dataset.window also allows us to control the stride and the shift when sampling the data. “Shift” represents the number of input elements by which the window moves in each iteration in creating a sub-dataset. “Stride” represents the stride of the input elements in the sliding window (sub-dataset).

Credits and References

TensorFlow guide

Deep Learning