Chapter 13 Loading and Preprocessing Data with TensorFlow

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

So far we have used only datasets that fit in memory, but Deep Learning systema are often trained on very large datasets that will not fit in RAM.

Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset object, and tell it where to get the data and how to transform it.

TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching.

Moreover, the Data API works seamlessly with tf.keras!

The Data API

The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items.

Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.

Shuffling the Data

As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed (see Chapter 4).

Interleaving lines from multiple files

First, let's suppose that you've loarded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set.

Preprocessing the Data

Let's implement a small function that will perform this preprocessing.

Putting Everything Together

Prefetching

Using the Dataset with tf.keras

The TFRecord Format

The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently.

Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection.

You can create a compressed TFRecord file by setting the options argument:

A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, TFrecord files usually contain serialized protocol buffers (also called protbufs).