Chapter 13 Loading and Preprocessing Data with TensorFlow
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
So far we have used only datasets that fit in memory, but Deep Learning systema are often trained on very large datasets that will not fit in RAM.
Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset object, and tell it where to get the data and how to transform it.
TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching.
Moreover, the Data API works seamlessly with tf.keras!
The Data API
The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items.
Chaining Transformations
Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.
Shuffling the Data
As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed (see Chapter 4).
Interleaving lines from multiple files
First, let's suppose that you've loarded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set.
Preprocessing the Data
Let's implement a small function that will perform this preprocessing.
Putting Everything Together
Prefetching
Using the Dataset with tf.keras
The TFRecord Format
The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently.
Compressed TFRecord Files
It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection.
You can create a compressed TFRecord file by setting the options argument:
A Brief Introduction to Protocol Buffers
Even though each record can use any binary format you want, TFrecord files usually contain serialized protocol buffers (also called protbufs).
TensorFlow Protobufs
The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset.
Loading and Parsing Examples
Handling Lists of Lists Using the SequenceExample Protobuf
Preprocessing the Input Features
Encoding Categorical Features Using One-Hot Vectors
Encoding Categorical Features Using Embeddings
Word Enbeddings
Keras Preprocessing Layers
TF Transform
The TensorFlow Datasets (TFDS) Project
EXercises