Chapter 15 Processing Sequences Using RNNs and CNNs
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
RNNs are not the only types of neural networks capable of handling sequential data:
for small sequences, a regular dense network can do the trick:
and for very long sequences, such as audio samples or text,
convolutional neural networks can actually work quite well too.
We will discuss both of these possibilities, and we will finish this chapter by implementing a WaveNet: this is a CNN architecture capable of handling sequences of tens of thousands of time steps.
In Chapter 16, we will continue to explore RNNs and see how to use for natural language processing, along with more recent architectures based on attention mechanisms.
Using 1D convolutional layers to process sequences
In Chapter 14, we saw that a 2D convolutional layer works by sliding several fairly small kernels (or filters) across an image, producing multiple 2D feature maps (one per kernal).
Similarly, a 1D convolutional layer slides several kernels across a sequence, producing a 1D feature map per kernel.
Each kernel will learn to detect a single very short sequential pattern (no longer than the kernal size).
If you use 10 kernels, then the layer's output will be composed of 10 1-dimensional sequences (all of the same length), or equivalently you can view this output as a single 10-dimensional sequence.
This means that you can build a neural network composed of a mix of recurrent layers and 1D convolutional layers (or even 1D pooling layers).
If you use a 1D convolutional layer with a stride of 1 and "same" padding, then the output sequence will have the same length as the input sequence.
But if you use "valid" padding or a stride greater than 1, then the output sequence will be shorter than the input sequence, so make sure you adjust the targets accordingly.
For example, the following model is the same as earlier, except it starts with a 1D convolutional layer that downsamples the input sequence by a factor of 2, using a stride of 2.
The kernal size is larger than the stride, so all inputs will be used to compute the layer's output, and therefore the model can learn to preserve the useful information, dropping only the unimportant details.
By shortning the sequences, the convolutional layer may help the GRU layers detect longer patterns.
Note that we must also crop off the first three time steps in the targets (since the kernel's size is 4, the first output of the convolutional layer will be based on the input time steps 0 to 3), and downsample the targets by a factor of 2:
WaveNet
In a 2016 paper, Aaron van den Oord and other Deep-Mind researchers introduced an architecture called WaveNet.
They stacked 1D convolutional layers, doubling the dilation rate (how spread apart each neuron's inputs are) at each layer:
the first convolutional layer gets a glimpse of just two time steps at a time, while the next one sees four time steps (its receptive field is four time steps long), the next one sees eight time steps, and so on.
This way, the lower layers learn short-term patterns, while the higher layers learn long-term patterns.
Thanks to the doubling dilation rate, the network can process extremely large sequence very efficiently.
In the WaveNet paper, the authors actually stacked convolutional layers with dilation rates of 1, 2, 4, 8, ..., 256, 512, then they stacked another group of 10 identical layers (also with dilation rates 1, 2, 4, 8, ..., 256, 512), then again another identical group of 10 layers.
They justified this architecture by pointing out that a single stack of 10 convolution layers with these dilation rates will act like a super-efficient convolutional layer with a kernel of size 1,024 (ezcept way faster, more powerful, and using significantly fewer parameters), which is why they stacked 3 such blocks.
They also left-padded the input sequences with a number of zeros equal to the dilation rate before every layer, to preserve the same sequence length throughout the network.
Here is how to impliment a simplified WaveNet to tackle the same sequence as earlier:
This Sequential model starts with an explicit input layer (this is simpler than trying to set input_shape only on the first layer), then continues with a 1D convolutional layer using "causal" padding:
this ensures that the convolutional layer does not peek into the future when making predictions (it is equivalent to padding the inputs with the right amount of zeros on the left and using "valid" padding).
We then add similar pairs of layers using growing dilation rates: 1, 2, 4, 8, and again 1, 2, 4, 8.
Finally, we add the output layer: a convolutional layers with 10 filters of size 1 and without any activation function.
Thanks to the padding layers, every convolutional layer outputs a sequence of the same length as the input sequences, so the targets we use during training can be the full sequences:
Chapter 14 Deep Computer Vision Using Convolutional Neural Network
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
In this Chapter we will explore where CNNs came from, what their building blocks look like, and how to implement them using TensorFlow and Keras.
Then we will discuss some of the best CNN architectures, as well as other visual tasks, including object detection (classifying multiple objects in an image and placing bounding boxes around them) and semantic segmentation (classifying each pixel according to the class of the object it belongs to).
The Atchitecture of the Visual Cortex
David H. Hubel and Torsten Wiesel performed a series of experiments on cats in 1958 and 1959 (and a few years later on monkeys), giving crucial insights into the structure of the visual cortex (the authors received the Nobel Prize in Physiology or Medicine in 1981 for this work).
In particular, they showed that many neurons in the visual cortex have a small local receptive field, meaning they react only to visual stimuli located in a limited region of the visual field (see Figure 14-1, in which the local receptive fields of five neurons are represented by dashed circles).
The receptive fields of different neurons may overlap, and together they tile the whole visual field.
Moreover, the authors showed that some neurons react only to images of holizontal lines, while others react only to lines with different orientations (two neurons may have the same receptive field but react to different line orientations).
They also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower-level patterns.
These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons (in Figure 14-1, notice that each neuron is connected only to a few neurons from the previous layer).
This powerful architechture is able to detect all sorts of complex patterns in any area of the visual field.
Figure 14-1. Biological neurons in the visual cortex respond to specific patterns in small regions of the visual field called receptive fields; as the visual signal makes its way through consecutive brain modules, neurons respond to more complex patterns in larger receptive fields.
These studies of the visual cortex inspired the neocognitron, introduced in 1980, which gradually evolved into what we call convolutional neural networks.
(Kunihiko Fukushima, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position," Biological Cybernetics 36 (1980): 193-202.
<追記>
著者の福島 邦彦氏はご健在で、80才を過ぎた現在、精力的に研究を続けておられるようである。
2019年に出版されたIEICEのInvited Paper "Recent advances in the deep CNN neocognitron"は、これまでの研究の集大成のようで、1979年から2018年までの40年間に発表されたご自身の13件の論文が引用されている。
Chapter 13 Loading and Preprocessing Data with TensorFlow
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
So far we have used only datasets that fit in memory, but Deep Learning systema are often trained on very large datasets that will not fit in RAM.
Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset object, and tell it where to get the data and how to transform it.
TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching.
Moreover, the Data API works seamlessly with tf.keras!
The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items.
Chaining Transformations
Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.
Shuffling the Data
As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed (see Chapter 4).
Interleaving lines from multiple files
First, let's suppose that you've loarded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set.
Preprocessing the Data
Let's implement a small function that will perform this preprocessing.
Putting Everything Together
Prefetching
Using the Dataset with tf.keras
The TFRecord Format
The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently.
Compressed TFRecord Files
It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection.
You can create a compressed TFRecord file by setting the options argument:
A Brief Introduction to Protocol Buffers
Even though each record can use any binary format you want, TFrecord files usually contain serialized protocol buffers (also called protbufs).
TensorFlow Protobufs
The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset.
Loading and Parsing Examples
Handling Lists of Lists Using the SequenceExample Protobuf
Preprocessing the Input Features
Encoding Categorical Features Using One-Hot Vectors
Chapter 12 Custom Models and Training with TensorFlow
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
Up until now, we've used only TensorFlow's high-level API, tf.keras, but it already got us pretty far: we built various neural network architectures, including regression and classification nets, Wide & Deep nets, and self-normalizing nets, using all sorts of techniques, such as Batch Normalization, dropout, and learning rate schedules.
In fact, 95% of the use cases you will encounter will not require anything other than tf.keras (and tf.data see Chapter 13). But it's time to dive deeper into TensorFlow and take a look at its lower-level PythonAPI. This will be useful when you need extra control to write custom loss functions, custom metrics, layers, models, initializers, regularizers, weight constraints, and more. You may even need to fully control the training loop itself, for ezample to apply special transformations or constraints to the gradients (beyond just clipping them) or to use multiple optimizers for different parts of the network.
We will cover all these cases in this chapter, and we will also look at how you can boost your custom models and training algorithms using TensorFlow's automatic graph generation feature.
But first, let's take a quick tour of TensorFlow
A Quick Tour of TensorFlow
・Its core is very similar to NumPy, but with GPU support.
・It supports distributed computing (across multiple devices and servers).
・It includes a kind of just-in-time (JIT) compiler that allows it to optimize computations for speed and memory usage. It works by extracting the computation grapf from Python function, then optimizing it (e.g., by pruning unused nodes), and finally running it efficiently (e.g., by automatically running independent operations in parallel).
・Computation graphs can be exported to a portable format, so you can train a TensorFlow model in one environment (e.g., using Python on Linux) and run it in another (e.g., using Java on an Android device).
・It implements autodiff (see Chapter 10 and Appendix D) and provides some excellent optimizers, such as RMSProp and Nadam (see Chapter 11), so you can easily minimize all sorts of loss functions.
TensorFlow runs not only on Windows, Linux, and macOS, but also on mobile devices (using TensorFlow Lite), including both iOS and Android (see Chapter 19). If you do not use the PythonAPI, there are C++, Java, Go, and Swift APIs. There is even a JavaScript implementation called TensorFlow.js that makes it possible to run your models directry in your browser.
Note that writing t + 10 is equivalent to calling tf.add(t, 10) (indeed, Python calls the magic method t.__add__(19), which just calls tf.add(t, 10)). Other operators like - and * are also supported. The @ operator was added in Python 3.5, for matrix multiplication: it is equivalent to calling the tf.matmul( ) function.
The Keras API has its own low-level API, located in keras.backend. It includes functions like square( ), exp( ), and sqrt( ). In tf.keras, these functions generally just call the corresponding TensorFlow operations. If you want to write code that will be portable to other Keras implementations, you should use these Keras functions. However, they only cover a subset of all functions available in TensorFlow, so in this book we will use the TensorFlow operations directly. Here is as simple example using keras.backend, which is commonly named K for short:
Tensor play nice with NumPy: you can create a tensor from a NumPy array, and vise versa. You can even apply TensorFlow operations to NumPy arrays and NumPy operations to tensors:
Notice that NumPy was 64-bit precision by default, while TensorFlow uses 32-bit.
This is because 32-bit precision is generally more than enough for neural networks, plus it runs faster and uses less RAM.
So when you create a tensor from a NumPy array, make sure to set dtype=tf.float32.
Type Conversions
Variables
Other Data Structures
Custamizing Models and Training Algorithms
Custom Loss Functions
Saving and Loading Models That Contain Custom Components
Custom Activation Functions, Initializers, Regularizers, and Constraints
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
In Chapter 10 we introduced artificial neural networks and trained our first deep neural networks.
But they were shallow nets, with just a few hidden layers.
What if you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images?
You may need to train a much deeper DNN, perhaps with 10 layers or many more, each containing hundreds of neurons, linked by hundreds of thousands of connections.
・You may be faced with the tricky vanishing gradients problem or the related exploding gradients problem. This is when the gradients grow smaller and smaller, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train. # lower laiers = layers close to input
・You might not have enough training data for such a large network, or it might be too costly to lable.
・Training may be wxtremely slow.
・A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.
Chapter 10 Introduction to Artificial Neural Networks with Keras
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
From Biological to Artificial Neurons
Biological Neurons
Logical Computations with Neurons
The Perceptron
Scikit-Learn provides a Perceptron class that implements a single-TLU (threschold logic unit) netwoek. It can be used pretty much as you would expect - for example, on the iris dataset (introduced in Chapter 4):
import numpy as np from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
iris = load_iris() X = iris.data[:, (2, 3)] # petal length, petal width y = (iris.target == 0).astype(np.int)
Every layer except the output layer includes a bias neuron
and is fully connected to the next layer.
When an ANN contains a deep stack of a hidden layers,
it is called a deep neural network (DNN).
The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations.
For many years researchers struggled to find a way to train MLPs, without success.
But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the backpropagation training algorithm, whichi is still used today.
In short, it is Gradient Descent (introduced in Chapter 4) using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter.
In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error.
Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.
Let's run through this algorithm in a bit more detail:
・It handles one mini-batch at a time (for example, containing 32 instances each), and it goes through the full training set multiple times. Eaxh pass is called an epoch.
・Each mini-batch is passed to the network's input layer, which sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on untill we get the output of the last layer, the output layer. This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
・Next, the algorithm measures the network's output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error.
・Then it computes how much each output connection contributed to the error. This is done analytically by applying the chain rule (perhaps the most fundamental rule in calculus), which makes this step fast and precise.
・The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, working backward until the algorithm reaches the input layer. As explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).
・Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.
This algorithm is so important that it's worth summarizing it again:
for each training instance, the backpropagation algorithm first makes a prediction (forward pass) and measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient Descent step).
rectified linear unit function: ReLU(z) = max(0, z)
The popular activation functions and their derivatives are represented in Figure 10-8.
But weit!
Why do we need activation functions in th first place?
Well, if you chain several linear functions, all you get is a linear transformation.
For examaple, if f(x) = 2x +3 and g(x) =5x -1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5x - 1) + 3 = 10x +1.
So if you don't have some nonlineality between lauers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that.
Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
Regression MLPs
First, MLPs can be used for regression tasks.
If you want to predict a single value (e.g., the price of a house, given many of its features), then you just need a single output neuron: its output is the predicted value.
For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object in an image, you need to predict 2D coordinates, so you need two output neurons. If you also want to placea bounding box around the object, then you need two more numbers; the width and the height of the object. So, you end up with four output neurons.
In general, when building an MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of value.
If you want to guarantee that the output will always be positive, then you can use the ReLU activation function in the output layer.
Alternatively, you can use the softplus activation function, which is a smooth variant of ReLU: softplus(z) = log(1 + exp(z)). It is close to 0 when z is negative, and close to z when z is positive.
Finally, if you want to guarantee that the prediction will fall within a given range of values, then you can use the logistic function or the hyperbolic tangent, and then scale the lables to the appropriate range: 0 to 1 for the logistic function and -1 to 1 for the hyperbolic tangent.
The loss function to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can use the Huber loss, which is a combination of both.
# input neurons One per input feature (e.g., 28 x 28 = 784 for MNIST
# hidden layers Depends on the problem, but typically 1 to 5
# neurons per hidden layer Depends on the problem, but typically 10 to 100
# output neurons 1 per prediction dimension
Hidden activation ReLU (or SELU, see Chapter 11)
Output activation None, or ReLU/softplus (if positive outputs) or logistic/tanh (if bounded outputs)
Loss function MSE or MAE/Huber (if outliers)
Classification MLPs
MLPs can also be used for classification tasks.
For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class.
The estimated probability of the negative class is equal to one minus that number.
MLPs can also easily handle multilabel binary classification tasks (see Chapter 3).
For example, you could have an email classification system that predicts whethere each incoming email is ham or spam, and simultaneously predicts whether it is an urgent or nonurgent.
In this case, you would need two output neurons, both using the logic activation function: the first would output the probability that the email is spam, and the second would output the probability that it is urgent.
If each instance can belong oy to a single class, out of three or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer.
The softmax function (introduced in Chapter 4) will ensure that all the estimated probabilities are between 0 and 1 and that they add up to 1 (whichi is required if the classes are exclusive).
input and hidden layers ------------------------ same as regression ------------------------
# output neurons 1 1 per lable 1 per class
Output layer activation Logistic Logistic Softmax
Loss function Closs entropy Cross entropy Closs entropy
Implementing MLPs with Keras
Installing TensorFlow 2
Building an Image Classifier Using the Sequential API
First, we need to load a dataset. In this chapter we will tackle Fashon MNIST, which is a drop-in replacement of MNIST (introduced in Chapter 3). It has the exact same format as MNIST (70,000 grayscale images of 28 x 28 pixels each, with 10 classes), but the images represent fashion items rather than handwritten digits, so each class is more diverse, and the problem turns out to be significantly more challenging than MNIST. For example, a simple linear model reaches about 92% accuracy on MNIST, but only about 83% on Fashion MNIST.
When loading MNIST or Fashion MNIST using Keras rather than SciKit-Learn, one important differenve is that every image is represented as a 28 x 28 array rather than a 1D array of size 784. Moreover, the pixel intensities are represented as integers (from 0 to 255) rather than floats (from 0.0 to 255.0).
X_train_full.shape
(60,000, 28, 28)
X_train_full.dtype
dtype( 'unit8' )
Note that the dataset is already split into a training set and a test set, but there is no validation set, so we'll create one now.
Additionally, since we are going to train the neural network using Gradient Descent, we must scalethe input features.
入力データのスケーリングは、必ずチェックすること!
For simplicity, we'll scale the pixel intensities down to the 0-1 range by dividing them by 255.0 (this also converts them to floats):
With MNIST, when the lable is equal to 5, it means that the image represents the handwritten digit 5. Easy. For Fashion MNIST, however, we need the list of class names to know what we are dealing with:
・The first line creates a sequential model. This is the simplest kind of Keras model for neural networks that are just composed of a single stack of layers connected sequentially. This is called the Sequential API.
・Next, we buid the firstlayer and add it to the model. It is a flatten layer whose role is to convert each input image into 1D array: if it receives input data X, it computes X.reshape(-1, 1).
This layer does not have any parameters; it is just there to do some simple preprocessing.
Since it is the first layer in the model, you should specify the input_shape, which doesn't include the batch size, only the shape of the instances.
Alternatively, you could add a keras.layers.Inputlayer as the first layer, setting input_shape=[28, 28]
・Next we add a Dense hidden layer with 300 neurons.
It will use the ReLU activation function.
Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs.
It also manages a vector of bias terms (one per neuron).
When it receives some input data, it computes Equation 10-2.
Equation 10-2: hw,b(X) = Φ(XW + b):
weight matrix w, bias bector b, matrix of input features X, activation function Φ
・Then we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function.
・Finally, we add a Dense output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive).
Specifying activation="relu" is equivalent to specifying activation=keras.activations.relu. Other activation functions are available in the keras.activations package.
次の表現もOK
Instead of adding the layers one by one as we just did, you can pass a list of layers when creating the Sequential model.
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="relu"),
keras.layers.Dense(100, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
The model's summary( ) method displays all the model's layers, including each layer's name (which is automatically generated unless you set it when creating the layer), its output shape (None means the batch size can be anything), and its number of parameters.
The summary ends with the total number of parameters, including trainable and non-trainable parameters.
Here we only have trainable parameters (we will see examples of non-trainable parameters in Chapter 11).
Note that Dense layer often have a lot of parameters.
For example, the first hidden layer has 784 x 300 connection weights, plus 300 bias terms, which adds up to 235,500 parameters!
This gives the model quite a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting, especially when you do not have a lot of training data.
We will come back to this later.
You can easily get a model's list of layers, to fetch a layer by its index, or you can fetch it by name.
model.layers
[<tensorflow.python.keras.layers.core.Flatten at 0x12c15d588>, <tensorflow.python.keras.layers.core.Dense at 0x149938fd0>, <tensorflow.python.keras.layers.core.Dense at 0x149b198d0>, <tensorflow.python.keras.layers.core.Dense at 0x149b19be0>]
hidden1 = model.layers[1]
hidden.name
'dense'
model.get_layer('dense') is hidden
True
All the parameters of a layer can be accessed using its get_weights( ) and set_weights( ) methods.
For a Dense layer, this includes both the connection weights and the bias terms:
Notice that the Dense layer initialized the connection weights randomly (which is needed to break symmetry, as we discussed earlier), and the biases were initialized to zeros, which is fine.
If you ever want to use a different initialization method, you can set kernel_initializer (kernel is another name for the matrix of connection weights) or bias_initializer when creating the layer.
First, we use the "sparse_categorical_crossentropy" loss because we have sparse lables (i.e., for each instance, there is just a target class index, from 0 to 9 in this case), and the classes are exclusive.
If instead we had one target probability per class for each instance (such as one-hot vectors, e.g. [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] to represent class 3), then we would need to use the "categorical_crossentropy" loss instead.
If we were doing binary classification (with one or more binary lables), then we would use the "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "binary_crossentropy" loss.
If you want to convert sparse lables (i.e., class indices) to one hot vector lables, use the keras.utils.to_categorical( ) function.
To go the other way round, use the np.argmax( ) function with axis=1.
Regarding the optimizer, "sgd" means that we will train the model using simple Stochastic Gradient Descent.
In other words, Keras will perform the backpropagation algorithm described earlier (i.e., reverse-mode autodiff plus Gradient Descent).
We will discuss more efficient optimizers in Chapter 11 (they improve the Gradient Descent part, not the autodiff).
When using the SGD optimizer the SGD optimizer, it is impoetant to tune the learning rate. So, you will generally want to use optimizer=keras.optimizers.SGD(lr=???) to set the learning rate, rather than optimizer="sgd", which defaults to lr=0.01.
Finally, since this is a classifier, it's useful to measure its "accuracy" during training and evaluation.
We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to train (or else it would default to just 1, which would definitely not be enough to converge to a good splution).
We also pass a validation set : validation_data=(X_valid, y_valid)(this is optional).
オプションと書いてあるが、validation setは、必須である。
Keras will measure the loss and the extra metrics on this set at the end of each epoch, which is very useful to see how well the model really performs.
If the performance on the training set is much better than on the validation set, your model is probably overfitting the training set (or there is a bug, such as a data mismuch between the training set and the validationset).
At each epoch during training, Keras displays the number of instances processes so far (along with a progress bar), the mean training time per sample, and the loss and accuracy (or any other extra metrics you asked fore) on both the training set and the validation set.
You can see that the training loss went down, which is a good sign, and the validation accuracy reached 89.26% after 30 epochs.
That's not too far from the training accuracy (91.92%), so there does not seem to be much overfitting going on.
If the training set was very skewed, with some classes being overrepresented and others underrepresented, it would be useful to set the class_weight argument when calling the fit( ) method, which would give a larger weight to underrepresented classes and a lower weight to overrepresented classes.
These weights would be used by Keras when computing the loss.
If you need per-instance weights, set the sample_weight argument (if both class_weight and sample_weight are provided, Keras multiples them).
Per-instance weights could be useful if some instances were labled by experts while others were labled using a crowdsourcing platform:
you might want to give more weight to the former.
You can also provide sample weights (but not class weights) for the validation set by adding them as a third item in the validation_data tuple.
信頼性の高いラベル付きデータに対しては、高い重みを与える。
The fit( ) method returns a History object containing the training parameters (history.params), the list of epochs it went through (history.epoch), and most importantly a dictionary (history.history) containing the loss and extra metrics it measured at the end of each epoch on the training set and on the validation set (if any).
If you use this dictionary to create a pandas DataFrame and call its plot( ) method, you get the learning curves shown in Figure 10-12:
plt.gca( ).set_ylim(0, 1) # set the vetical range to [0 - 1]
plt.show( )
You ca see that both the training accuracy and the validation accuracy steadily increase during training, while the training loss and the validation loss decrease.
Good!
Moreover, the the validation cueves are close to the training curves, which means that there is not too much overfitting.
In this particular case, the model looks like it performed better on the validation set than on the training set at the beginning of training.
But that's not the case:
indeed, the validation error is computed at the end of each epoch, while the training error is computed using a running mean during each epoch.
So the training curve should be shifted by half an epoch to the left.
If you do that, you will see that the training and validation curves overlap aimost perfectry at the beginning of the training.
The training set performance ends up beating the validation performance, as is generally the case when you train for long enough.
You can tell that the model has not quite converged yet, as the validation loss is still going down, so you should probably continue training.
It's as simple as calling the fit( ) method again, since Keras just continues training where it left off (you should be able to 89% validation accuracy).
If you are not satisfied with the performance of your model, you should go back and tune the hyperparameters.
The first one to check is the leraning rate.
If that doesn't help, try another optimizer (and always retune the learning rate after changing any hyperparameter).
If the performance is still not great, then try tuning model hyperparameters such as the number of layers, the number of neurons per layer, and the types of activation functions to use for each hidden layer.
You can also try tuning other hyperparameters, such as the batch size (it can be set in the fit( ) method using the batch_size argument, which defaults to 32).
We will get back to hyperparameter tuning at the end of this chapter.
Once you are satisfied with your model's validation accuracy, you should evaluate it on the test set to estimate the generalization error before you deploy the model to production.
You can easily do this using the evaluate( ) method (it also supports several other arguments, such as batch_size and sample_weight; please check the documentation for more details):
As we saw in Chapter 2, it is common to get slightly lower performance on the test set than on the validation set, because the hyperparameters are tuned on the validation set, not the test set (however, in this example, we did not do any hyperparameter tuning, so the lower accuracy is just bad luck).
Remember to resist the temptation to tweak the hyperparameters on the test set, or else your estimate of the generalization error will be too optimistic.
Using the model to make predictions
Next, we can use the model's predict( ) method to make predictions on new instances.
Since we don't have actual new instances, we will just use the first three instances of the test set:
Building a Regression MLP Using the Sequential API
Let's switch to the California housing problem and tackle it using a regression neural network.
For simplicity, we will use Scikit-Learn's fetch_california_housing( ) function to load the data.
This dataset is simpler than the one used in Chapter 2, since it contains only numerical features (there is no ocean_proximity feature), and there is no missing value.
After loading the data, we split it into a training set, a validation set, and a test set, and we scale all the feature.
from sklearn.dataset import fetch_california_housing
from sklearn.model_selection import train_test_split
Using the Sequential API to build, train, evaluate, and use a regression MLP to make predictions is quite similar to what we di for classification.
The main differences are the fact that the output layer has a single neuron (since we only want to predict a single value) and uses no activation function, and the loss function is the mean squared error.
Sinse tha dataset is quite noisy, we just use a single hidden layer with fewer neurons than before, to avoid overfitting:
X_new = X_test[ :3] # pretend these are new instances
y_pred = model.predict(X_new)
As you can see, the Sequential API is quite easy to use,
However, although Sequential models are extremely common, it is sometimes useful to build neural networks with more complex topologies, or with multiple inputs or outputs.
One example of a nonsequential neural network is a Wide & Deep neural network.
This neural network architechture was introduced in a 2016 paper by Heng-Tze Cheng et al.
It connects all or part of the inputs directly to the output layer, as shown in Figure 10-14.
This architecture makes it possible for the neural network to learn both deep patterns (using the deep path) and simple rules (through the short path).
In contrast, a regular MLP forces all the data to flow through the full stack of layers, thus, simple patterns in the data may be end up being distorted by this sequence of transformations.
Let's build such a neural network to tackle the California housing problem:
Note that we are just telling Keras how it should connect the layers together; no actual data is being processed yet.
・We then create a second hidden layer, and again we use it as a function.
Note that we pass it the output of the first hidden layer.
・Next, we create a Concatenate layer, and once again we immediately use it like a function, to concatenate the input and the output of the second hidden layer.
You may prefer the keras.layers.concatenate( ) function, which creates a Concatenate layer and immediately calls it with the given inputs.
・Then we create the output layer, with a single neuron and no activation function, and we call it like a function, passing it the result of the concatenation.
・Lastly, we create a Keras Model, specifying which inputs and outputs to use.
Once you have built the Keras model, everything is exactly like earlier, so there's no need to repeat it here: you must compile the model, train it, evaluate it, and use it to make predictions.
But what if you want to send a subset of the features through the wide path and a different subset (possibly overlapping) through the deep path (see Figure 10-15)?
In this case, one solution is to use multiple inputs.
For example, suppose we want to send five features through the wide path (feature 0 to 4), and six features through the deep path (feature 2 to 7):
model = keras.Model(inputs=[input_A, input_B], outputs=[output])
The code is self-explanatory.
You should name at least the most important layers, especially when the model gets a bit complex like this.
Note that we specified inputs=[input_A, input_B] when creating the model.
Now we can compile the model as usual, but when we call the fit( ) method, instead of passing a single input matrix X_train, we must pass a pair of matrices (X_train_A, X_train_B): one per input.
The same is true for X_valid, and also for X_test and X_new when you call evaluate( ) or predict( ):
Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron
Although most of the applications of Machine Learning today are based on supervised learning (and as a result, this is where most of the investments go to), the vast majority of the available data is unlabeled:
we have the input features X, but we do not have the lables y.
The computer scientist Yann LeCun famously said that
"if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on th cake, and reinforcement learning would be the cherry on the cake"
In other words, there is a huge potential in unsupervised learning that we have only barely started to sink our teeth into.
Say you want to create a system that will take a few pictures of each item on a manufacturing production line and detect which items are defective.
You can fairly easily create a system that will take pictures automatically, and this might give you thousands of pictures every day.
You can then build a reasonably large dataset in just a few weeks.
But wait, there are no labels!
If you want to train a regular binary classifier that will predict whether an item is defective or not, you will need to label every single pictures as "defective" or "normal".
This will generally require human experts to sit down and manually go through all the pictures.
This is a long, costly, and tedious task, so it will usually only be done on a small subset of the available pictures.
As a result, the labeled dataset will be quite small, and the classifier's performance will be disappointing.
Moreover, every time the company makes any change to its products, the whole process will need to be started over from scratch.
Wouldn't it be great if the algorithm could just exploit the unlabeled data without needing humans to label every picture?
Enter unsupervised learning.
In Chapter 8 we looked at the most common unsupervised learning task:
dimensionality reduction.
In this chapter we will look at a few more unsupervised learning tasks and algorithms:
Clustering
The goal is to group similar instances together into clusters. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more.
Anomaly detection
The objective is to learn what "normal" data looks like, and then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.
Density estimation
This is the task of estimating the probability density function (PDF) of the random process that generated the dataset. Density estimation is commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis and visualization.
Ready for some cake?
We will start with clustering, using K-Means and DBSCAN, and then we will discuss Gaussian mixture models and see how they can be used for density estimation, clustering, and anomary detection.
Clustering
As you enjoy a hike in the mountains, you stumble upon a plant you have never seen before. You look around an d you notice a few more.
They are not identified, yet they are sufficiently similar for you to know that they most likely belong to the same species (or at least the same genus).
You may need a botanist to tell you what species that is, but you certainly don't need an expert to identify groupes of similar-looking objects.
This is called clustering: it is the task of identifying similar instances and assighning them to clusters, or groups of similar instances.
Just like in classification, each instance gets assighned to a group.
However, unlike classification, clustering is an unsupervised task.
Consider Figure 9-1: on the left is the iris data set (introduced in Chapter 4), where each instance's species (i.e., its class) is represented with a differnt marker.
It is a labeled dataset, for which classification algorithms such as Logistic Regression, SVMs, or Random Forest classifiers are well suited.
On the right is the same dataset, but without the labels, so you cannot use a classification algorithm anymore.
This is where clustering algorithms step in:
many of them can easily detect the lower-left cluster.
It is also quite easy to see with our own eyes, but it is not so obvious that the upper-right cluster is composed of two distinct sub-cluster.
That said, the dataset has two additional features (sepal length and width), not represented here, and clustering algorithms can make good use of all features, so in fact they identify the three clusters fairly well (e.g., using a Gaussian mixture model, only 5 instances out of 150 are assigned to the wrong cluster.
Clustering is used in a wide variety of applications, including these:
For customer segmentation
You can cluster your customers based on their purchases and their activity on your website.
This is useful to understand who your customers are and what they need, so you can adapt your products and marketing campains to each segment.
For example, customer segmentation can be useful in recommender systems to suggest content that other users in the same cluster enjoyed.
For data analysis
When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster separately.
As a dimensionality reduction technique
Once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster (affinity is any measure of how well an instance fits into a cluster).
Each instance's feature vector x can then be replaced with the vector of its cluster affinities.
If there are k clusters, then this vector is k-dimensional.
This vector is typically much lower-dimensional than the original feature vector, but it can preserve enough information for further processing.
For anomaly detection (also called outlier detection)
Any instance that has a low affinity to all the clusters is likely to be an anomaly.
For example, if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second.
Anomary detection is particularly useful in detecting defects in manufacturing, or for fraud detection.
For semi-supervised learning
If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same cluster.
This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance.
For search engines
Some search engines let you search for images that are similar to a reference image.
To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster.
Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image's cluster, and you can then simply return all the images from this cluster.
To segment an image
To clustering pixels according to their color, then replacing each pixel's color with the mean color of its cluster, it is possible to considerably reduce the number of different colors in the image.
Image segmentation is used in many object detection and tracking systems, as it makes it easier to detect the contour of each object.
There is no universal definition of what a cluster is: it really depends on the context, and different algorithms will capture different kinds of clusters.
Some algorithms look for instances centered around a particular point, called centroid.
Others look for continuous regions of densely packed instances:
these clusters can take on any shape.
Some algorithms are hierarchical, looking for clusters of clusters.
And the list goes on.
In this section, we will look at two popular clustering algorithms, K-Means and DBSCAN, and explore some of their applications, such as nonlinear dimensionality reduction, semi-supervised learning, and anomary detection.
K-Means
Consider the unlabeled dataset represented in Figure 9-2: you can clearly see five blobs of instances.
The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very quickly and efficiently, ofen in just a few iterations.
It was proposed by Stuart Lloyd at Bell Labs in 1957 as a technique for pulse-code modulation, but it was only published outside of the company in 1982.
In 1965, Edward W. Forgy had published virtually the same algorithm, so K-Means is sometimes referred to as Lloyd-Forgy.
Lat's train a K-Means clusterer on this dataset.
It will try to find each blob's center and assign each instance to the closest blob:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
Note that you have to specify the number of clusters k that the algorithm must find.
In this example, it is pretty obvious from looking at the data that k should be set to 5, but in general it is not that easy.
We will discuss this shortly.
Each instance was assigned to one of the five clusters.
In the context of clustering, an instance's label is the index of the cluster that this instance gets assigned to by the algorithm: this is not to be confused with tha class labels in classification (remember that clustering is an unsupervised learning task).
The KMeans instance preserves a copy of the labels of the instances it was trained on, available via the labels_ instance variable:
y_pred
array([4, 0, 1, ..., 2, 1, 0], dtype=int32)
y_pred is kmeans.labels_
True
We can also take a look at the five centroids that the algorithm found:
kmeans.cluster_centers_
array([[-2.80389616, 1.80117999], # index of cluster = 0 : [X1, X2]
[ 0.20876306, 2.25551336], # index of cluster = 1
[-2.79290307, 2.79641063], # index of cluster = 2
[-1.46679593, 2.28585348], # index of cluster = 3
[-2.80037642, 1.30082566]]) # index of cluster = 4
You can easily assign new instances to the cluster whose centroid is closest: