AI_ML_DL’s diary

人工知能、機械学習、ディープラーニングの日記

Chapter 13 Loading and Preprocessing Data with TensorFlow

Chapter 13  Loading and Preprocessing Data with TensorFlow

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

 

So far we have used only datasets that fit in memory, but Deep Learning systema are often trained on very large datasets that will not fit in RAM.

Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset object, and tell it where to get the data and how to transform it.

TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching.

Moreover, the Data API works seamlessly with tf.keras!

 

 

The Data API

The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items.

 

Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.

 

Shuffling the Data

As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed (see Chapter 4).

 

Interleaving lines from multiple files

First, let's suppose that you've loarded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set.

 

Preprocessing the Data

Let's implement a small function that will perform this preprocessing.

 

Putting Everything Together

 

 

Prefetching

 

 

Using the Dataset with tf.keras

 

 

The TFRecord Format

The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently.

 

Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection.

You can create a compressed TFRecord file by setting the options argument:

 

A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, TFrecord files usually contain serialized protocol buffers (also called protbufs).

 

TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset.

 

Loading and Parsing Examples

 

 

Handling Lists of Lists Using the SequenceExample Protobuf

 

 

Preprocessing the Input Features

 

 

Encoding Categorical Features Using One-Hot Vectors

 

 

Encoding Categorical Features Using Embeddings

 

 

Word Enbeddings

 

 

Keras Preprocessing Layers

 

 

TF Transform

 

 

The TensorFlow Datasets (TFDS) Project

 

 

EXercises

 

 

 

f:id:AI_ML_DL:20200520090157p:plain

style=134 iteration=1

 

f:id:AI_ML_DL:20200520090102p:plain

style=134 iteration=20

 

f:id:AI_ML_DL:20200520085953p:plain

style=134 iteration=500