2020-05-20

Chapter 8 Dimensionality Reduction

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

Many Machine Learning problems involve thousands or even millions of features for each training instance.

Not only do all these features make training extremely slow, but they can also make it much harder to find a good solution, as we will see.

This problem is often refered to as the curse of dimentionality.

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one.

For example, consider the MNIST images (introduced in Chapter 3):

the pixels on the image borders are almost always white, so you could completely drop these pixels from the training set without losing much information.

Figure 7-6 confirms that these pixels are utterly unimportant for the classification task.

Additionally, two neighboring pixels are often highly correlated:

if you merge them into a single pixel (e.g., by taking the mean of the two pixel intensities), you will not lose much information.

振り返ると、手書き文字（数字）認識のMNISTは、toy problemにしか見えない。

Reducing dimensionality does cause some information loss (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may make your system perform slightly worse.

It also makes your pipeline a bit more complex and thus harder to maintain.

So, if training is too slow, you should first try to train your system with the original data before considering using dimensionality reduction.

In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general it won't:

it will just speed up training.

Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization (or DataVis).

Reducing the number of dimensions down to two (or three) makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters.

Moreover, DataViz is essential to communicate your conclusions to people who are not data scientists - in particular, decision makers who will use your results.

In this chapter we will discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space.

Then, we will consider the two main approaches to dimensionality reduction (projection and Manifold Learning), and we will go through three of the most popular dimensionality reduction techniques:

PCA, Kernel PCA, and LLE.

The Curse of Dimensionality

We are so used to living in three dimensions that our intuition fails us when we try to imagine a high-dimensional space.

Even a basic 4D hypercube is incredibly hard to picture in our mind (see Figure 8-1), let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.

f:id:AI_ML_DL:20200529113929p:plain — Figure 8-1. Point, segment, square, cube, and tesseract (0D to 4D hypercubes) from Wikipedia

It turns out that many things behave very differently in high-dimensional space.

For example, if you pick a random point in a unit square (a 1 x 1 square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be "extreme" slong any dimension).

But in a 10,000-dimensional unit hypercube, this probability is greater than 99.000000%.

Most points in a high-dimensional hypercube are very close to the border.

＊理解できない。

Here is a more troublesome difference:

if you pick two points randomly in a unit square, the distance between these two points will be, on average, roughly 0.52.

If you pick two random points in a unit 3D cube, the average distance will be roughly 0.66.

But what about two points picked randomely in a 1,000,000-dimensional hypercube?

The average distance, believe it or not, will be about 408.25 (roughly √1,000,000/6)!

This is counterintuitive:

how can two points be so far apart when they both lie within the same unit hypercube?

Well, there's just plenty of space in high dimensions.

As a result, high-dimensional datasets are at risk of being very sparse:

most training instances are likely to be far away from each other.

This also means that a new instance will likely be far away from any training instance, making predintions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.

In short, the more dimensions the training set has, the greater the risk of overfitting it.

＊理解できない。

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances.

Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

With just 100 features (significantly fewer than in the MNIST problem), you would need more instances than atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions.

＊理解できない。

n-cube can be projected inside a regular 2n-gonal polygon by a skew orthogonal projection, shown here from the line segment to the 12-cube.

Hypercube：Wikipediaからの、12次元までの画像。

Main Approaches for Dimensionality Reduction

f:id:AI_ML_DL:20200520074606p:plain — style=129 iteration=1

2020-05-20

Chapter 7 Ensemble Learning and Random Forests

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases you will find that thisaggregated answer is better than an expert's answer. This is called the wisdom of crowd.

Similaely, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best indivisual predictor.

A group of predictors is called an ensemble; thus this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

As an example of an Ensemble method, you can train a grope of Decision Tree classifiers, each on a different random subset of the training set.

To make predictions, you obtain the predictions of all the individual trees, then predict the class that gets the most votes (see the last exercise in Chapter 6).

Such an ensemble of Decision Trees is called a Random Forest, and despite its simplicity, this is one of the most powerful Machine Learning algorithms available today.

As discussed in Chapter 2, you will often use Ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor. In fact, the winning solutions in Machine Learning competitions often involve several Ensemble methods.

In this chapter we will discuss the most popular Ensemble methods, including bagging, boosting, and atacking. We will also explore Random Forests.

Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy.

You may have a Logistic Regression classifier, an SVM classifier, a Random Forest Classifier, a K-Nearest Neighbors classifier, and perhaps a few more.

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

This majority-vote classifier is called a hard voting classifier.

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble.

In fact, even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy), provided there are a sufficient number of weak learners and they are sufficiently diverse.

＊Ensembleが良い結果をもたらす理由の１つは、generalization、にありそうだな。

Ensemble methods work best when the predictors are as independent from one another as possible.

One way to get diverse classifiers is to train them using very different algorithm.

This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.

The following code creates and trains a voting classifier in Scikit-Learn, composed of three diverse classifiers (the training set is the moons dataset, introduced in Chapter 5):

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

log_clf = LogisticRegression( )

rnd_clf = RandomForestClassifier( )

svm_clf = SVC( )

voting _clf = VotingClassifier(

estimattors=[('lr', log_clf), ('rf', rnd_clf), ('svc', smc_clf)],

voting='hard')

voting_clf.fit(X_train, y_train)

Let's look at each classifier's accuracy on the test set:

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

...

LogisticRegression 0.864

RandomForestClassifier 0.896

SVC 0.888

VotingClassifier 0.904

There you have it!

The voting classifier slightly outperforms all the individual classifiers.

If all classifiers are able to estimate class probabilities (i.e., they all have a predict_proba( ) method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.

This is called soft voting.

It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

All you need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can estimate class probabilities.

This is not the case for the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba( ) method).

If you modify the preceding code to use soft voting, you will find that the voting classification achieves over 91.2% accuracy!

Bagging and Pasting

f:id:AI_ML_DL:20200520073733p:plain — style=128 iteration=1

f:id:AI_ML_DL:20200520073922p:plain — style=128 iteration=20

f:id:AI_ML_DL:20200520073547p:plain — style=128 iteration=500

2020-05-19

Chapter 6 Decision Trees

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

Like SVMs Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks, and even multioutput tasks.

They are powerful algorithms, capable of fitting complex datasets.

For example, in Chapter 2 you trained a DecisionTreeRegressor model on the California housing dataset, fitting it perfectly (actually, overfitting it).

Decision Trees are also the fundamental components of Random Forests (see Chapter 7), which are among the most powerful Machine Learning algorithms available today.

In this chapter we will start by discussing how to train, visualize, and make predictions with Decision Trees.

Then we will go through the CART training algorithm used by Scikit-Learn, and we will discuss how to regularrize trees and use them for regression tasks.

Finally, we will discuss some of the limitations of Decision Trees.

Training and Visualizing a Decision Tree

To understand Decision Trees, let's build one and take a look at how it makes predictions.

The following code trains a DecisionTreeClassifier on the iris dataset (see Chapter 4):

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

iris = load_iris( )

X = iris.data[ : , 2: ] # petal length and width

y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)

tree_clf.fit(X, y)

Figure 6-1. Iris Decision Treeは表示できないので省略。

Making Predictions

Let's see how the tree represented in Figure 6-1 makes predictions.

Suppose you find an iris flower and you want to classify it.

You start at the root node (depth 0, at the top):

this node asks whether the flower's petal length is smaller than 2.45 cm.

If it is, then you move down to the root's left child node (depth 1, left).

In this case, it is a leaf node (i.e., it does not have any child nodes), so it doed not ask any questions:

simply look at the predicted class for that node, and the Decision Tree predicts that your flower is an Iris setosa (class=setosa).

Now suppose you find another flower, and this time the petal length is greater than 2.45 cm.

You must move down to the root's right child node (depth 1, right), which is not a leaf node, so the node asks another question:

is the petal width smaller than 1.75 cm?

If it is, then your flower is most likely an Iris versicolor (depth 2, left).

If not, it is likely an Iris virginica (depth 2, right).

It's really that simple.

One of the many qualities of Decision Trees is that they require very little data preparation.

In fact, they don't require feature scaling or centering at all.

＊

f:id:AI_ML_DL:20200519230436p:plain — style=127 iteration=1

2020-05-19

Chapter 5 Support Vector Machines

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

A Support Vector Machine (SVM) is a powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection.

It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox.

SVMs are particularly well suited for classification of complex small- or medium-sized datasets.

This Chapter will explain the core concepts of SVMs, how to use them, and how they work.

Linear SVM Classification

The fundamental idea behind SVMs is best explained with some pictures.

Figure 5-1 shows part of the iris dataset that was introduced at the end of Chapter 4.

The two classes can clearly be separated easily with a straight line (they are linearly separable).

The left plot shows the decision boundaries of three possible linear classifiers.

The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly.

The other two models work perfectry on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances.

In contrast, the solid line in the plot on the right represent the decision boundaries of an SVM classifier;

this line not only separates the two classes but also stays as far away from the closest training instanses as possible.

You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.

This is called large margin classification.

f:id:AI_ML_DL:20200528115207p:plain — Figure 5-1. Large margin classification

Notice that adding more training instances "off the street" will not affect the decision boundary at all:

it is fully determined (or "supported") by the instances located on the edge of the street.

These instances are called the support vectors (they are circled in Figure 5-1).

SVMs are sensitive to the feature scale, as you can see in Figure 5-2:

in the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal.

After feature scaling (e.g., using Scikit-Learn's StandardScaler), the decision boundary in the right plot looks much better.

f:id:AI_ML_DL:20200528124248p:plain — Figure 5-2. Sensitivity to feature scale

Soft Margin Classification

If we strictly impose that all instances must be off the street and on the right side, this is called hard margin classification.

There are two main issues with hard margin classification.

First, it only works if the data is linearly separable.

Second, it is sensitive to outliers.

Figure 5-3 shows the iris dataset with just one additional outlier:

on the left, it is impossible to find a hard margin;

on the right, the decision boundary ends up very different from the one we saw in Figure 5-1 without the outlier, and it will probably not generalize as well.

f:id:AI_ML_DL:20200528124828p:plain — Figure 5-3. Hard margin sensitivity to outliers

To avoid these issues, use a more flexible model.

The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side).

This is called soft margin classification.

When creating an SVM model using Scikit-Learn, we can specify a number of hyperparameters.

C is one of those hyperparameters.

If we set it to a low value, then we end up with the model on the left of Figure 5-4.

With a high value, we get the model on the right.

Margin violations are bad.

It's usually better to have few of them.

However, in this case the model on the left has a lot of margin violations but will probably generalize better.

f:id:AI_ML_DL:20200528130636p:plain — Figure 5-4. Large margin (left) versus fewer margin violations (right)

If your SVM model is overfitting, you can try regularlizing it by reducing C.

The following Scikit-Learn code loads the iris dataset, scale the features, and then trains a linear SVM model (using the linear SVC class with C=1 and the hinge loss function, described shortly) to detect Iris virginica flowers.

import numpy as np

from sklearn import datasets

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.svm impoet LinearSVC

iris = datasets.load_iris( )

X = iris["data"][ : , (2, 3)] # petal length, petal width

y = (iris["target"] == 2).astype(np.float64) # iris virginica

svm_clf = Pipeline([

("scaler", StandardScaler( )),

("linear_svc", LinearSVC(C=1, loss="hinge")),

])

svm_clf.fit(X, y)

The resulting model is represented on the left in Figure 5-4.

Then, as usual, you can use the model to make predictions:

svm_clf.predict(5.5, 1.7)

array([1.])

Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class.

Instesd of using the Linear SVC class, we could use the SVC class with a linear kernel.

When creating the SVC model, we would write SVC(kernel="linear", C=1).

Or we could use the SGDClassifier class, with SGDClassifier(loss="hinge", alpha=1/(m*C)).

This applies regular Stochastic Gradient Descent (see Chapter 4) to train a linear SVM classifier.

It does not converge as fast as the Linear SVC class, but it can be useful to handle online classification tasks or huge datasets that do not fit in memory (out-of-core training).

The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean.

This is automatic if you scale the data using the StandardScaler.

Also make sure you set the loss hyperparameter to "hinge", as it is not the default value.

Finally, for better performance, you should set the duel hyperparameter to False, unless there are more features than training instances (we will discuss duality later in the chapter).

Nonlinear SVM Classification

SVMで良い結果を得たという情報に出会うまで、ペンディング！

そういう事例に遭遇して、勉強不足と感じたら、学習再開！

f:id:AI_ML_DL:20200519200638p:plain — style=126 iteration=1

f:id:AI_ML_DL:20200519200417p:plain — style=126 iteration=20

f:id:AI_ML_DL:20200519200239p:plain — style=126 iteration=500

2020-05-18

Chapter 4 Training Models

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

2章、3章で、housingデータを使ったregressionと、MNISTデータを使ったclassifierを学んで、いろいろなモデルやテクニックを実際に使うことで、すぐにでも、自分でなにかできそうな気がしてくるのだが、そんなに甘くはない。

このあたりまではしっかり読むが、ここからは、なにか分かった気になってつまみ食いに走るのだろう。

そして、気が付いたときには、プログラムが一行も書けない自分がそこにいる。

ということにならないように、がんばろう。

Introduction

2章と3章で、machine learning models とtraining algorithmsの大部分を、black boxとして扱ってきた。regression systemのoptimizationやdigit image classifierのimprovementもやってきた。Exerciseではspam classifierをゼロから作ってきた。それらの詳細を知らないでやってきた。多くの場合、実装の詳細まで知る必要はない。

しかしながら、それでは、適切なモデルの選定も、適切なデータの前処理も、適切なモデルの学習も、モデルの性能向上のためのハイパーパラメータの設定も、できないであろう。さらに、機械学習の根本を理解しておけば、デバッグや、エラー解析を効率的に行えるであろう。これらの議論は、テキストの後半で取り組むニューラルネットワークの構築や学習、高性能化においても同様に、重要である。

Linear Regression

＊数式とその説明がたくさん出てくるが、ここには直接記述するのが難しいし、他で書いてここに張り付けるのも面倒なので、記事としては、わけのわからないものになってしまう。

線形回帰モデルは、

y-hat = θ0 + θ1x1 + θ2x2 + θ3x3 +・・・+ θnxn

y-hat : the predicted value

n : the number of features

xi : i-th feature value

θj : j番目のパラメータ

featureとパラメータとの単純な線形結合で表されるLinear regressionは単純明快。

線形に見えるデータを乱数を使って発生させ、解析してみせるということのようだ。

The Normal Equation

import numpy as np

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

randは、0~+1の範囲でランダムな数値を出力する。
randnは、平均0、標準偏差1の正規分布で、ランダムな数値を出力する。

f:id:AI_ML_DL:20200520140114p:plain — Figure 4-1. Randomly generated linear dataset

X_b = np.c_[np.ones*1, X] # add x0 = 1 to each instance

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

theta_best

array([[4.21509616],

[2.77011339]])

make a prediction

X_new = np.array([[0], [2]])

X_new_b = np.c_[np.ones*2, X_new] # add x0 = 1 to each instance

y_predict = X_new_b.dot(theta_best)

y_predict

array([[4.21509616],

[9.75532293]])

Let's plot this model's predictions (Figure 4-2):

plt.plot(X_new, y_predict, "r-")

plt.plot(X, y, "b.")

plt.axis([0, 2, 0, 15])

plt.show( )

f:id:AI_ML_DL:20200520142351p:plain — Figure 4-2. Linear Regression model predictions

これを、Scikit-Learn のLinear Regressionで実行すると、

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression( )

lin_reg.fit(X, y)

lin_reg.intercept_, lin_reg.coef_

(array([4.21509616]), array(2.77011339))

lin_reg.predict(X_new)

array([[4.21509616],

[9.75532293]])

The LinearRegression class is based on the scipy.linalg.lstsq( ) functions (the name stands for "linear squares"), which you could call directry:

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)

theta_best_svd

array([[4.21509616],

[2.77011339]])

You can use np.linalg.pinv( ) to compute the pseudoinverse directly.

np.linalg.pinv(X_b).dot(y)

array([[4.21509616],

[2.77011339]])

最後に、seudoinverseの計算方法、Singular Value Decomposition (SVD)：特異値分解を用いる方法について説明している。

SVDは、残念ながら、フォローできない。

Computational Complexity

前節に引き続き、計算方法について説明しているようである。

Gradient Descent

Gradient Descentは、wide range of problemsに対して、optimal solution を見つけることができるgeneric optimization algolismである。general idea of Gradient Descentは、cost functionをminimizeするためにparameterをiterativeにtwak(微調整)することである。

山で濃霧に遭って道に迷ったら、足元の傾斜だけが頼りになる。早く下山するためには傾斜が最も大きい方向に沿って移動するのが良い。Gradient Descentはこれと同じで、parameter vector θに対するerror functionのlocal gradientを測り、傾斜が大きい方向に移動する。傾斜がゼロになったところでerror functionが最小になる。

learning rateを適切に選ぶことが重要。

local minimum がありうる。learning rateと関係するところもある。

caution：全てのfeatureのスケーリングが重要（必須）である。

Batch Gradient Descent

重要な式を、自分が理解できなくても、ここに示せればよいのだが、断念します。

プログラムコードを写しておきます。

eta = 0.1 # learning rate

n_iterations = 1000

m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):

gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)

theta = theta - eta * gradients

theta

array([[4.21509616],

[2.77011339]])

次の図はlearning rate eta を変えた場合の、最初の10ステップを示している。

f:id:AI_ML_DL:20200519233417p:plain — Figure 4-8. Gradient Descent with various learning rate

左の図に示されているように、learning rateが小さいと、時間がかかる。しかし、解には近づいていく。

learning rateが適切であれば、中央の図に示されるように少ない回数で解に到達する。

右の図に示されているように、learning rateが大きすぎると、発散してしまって解に到達できないことになる。

適切なlearning rateを得るためには、2章で紹介したgrid searchiを用いるのが良い。

ただし、learning rateが大きすぎると収束せず、learning rateが小さすぎると時間がかかるので、最初はステップを大きくとってみるとか、iteration numberを小さめにとって多くのパラメータを試すとか、ランダムサーチを使ってみるとか、いろいろ試す必要がある。調べる範囲が偏ってしまって、いつまで経っても最適値が得られないということは避けたい。結果が出始めたら、途中で、learning rateと所要時間と最小値の値をプロットして傾向を調べることも大切だ。

Stochastic Gradient Descent

f:id:AI_ML_DL:20200519233520p:plain — Figure 4-10. The first 20 steps of Stochastic Gradient Descent

Mini-batch Gradient Descent

f:id:AI_ML_DL:20200519233307p:plain — Figure 4-11. Gradient Descent paths in parameter space

polynomial Regression

What if your data is more complex than a stright line?

Surprisingly, you can use a linear model to fit nonlinear data.

A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features.

This technique is called Polynomial Regression.

Let's look at an example.

First, let's generate somo nonlinear data, based on a simple quadratic equation (plus some noise; see figure 4-12):

m = 100

X = 6 * np.random.rand(m, 1) - 3

y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

Clearly, a straight line will never fit this data property.

So let's use Scikit-Learn's PolynomialFeatures class to transform our training data, adding the square (second-degree polynomial) of each feature in the training set as a new feature (in this case there is just one feature):

from sklearn.preprocessing import PolynomialFeatures

poly_features = Polynomiaatures(degrelFee=2, include_bias=False)

X_poly = poly_features.fit_transform(X)

X[0]

array([-0.75275929])

X_poly

array([-0.75275929, 0.56664654])

X_poly now contains the original feature of X plus the square of this feature.

Now you can fit a LinearLigression model to this extended training data (Figure 4-13):

lin_reg = LinearRegression( )

lin_reg.fit (X_poly, y)

ｙlin_reg.intercept_, lin_reg.coef_

(array([1.78134581]), array(0.93366893, 0.56456263))

＊特徴量として2次のPolynomialFeaturesを追加すれば、LinearRegression modelでフィッティングできるということだ。

Not bad: the model estimates y-hat = 0.56*x1^2 + 0.93*x1 + 1.78 when in fact the original function was y = 0.5*x1^2 + 1.0*x1 +2.0 + Gaussian noise.

Note that when there are multiple features, Polynomial Regression is capable of finding relationships between features (which is something a plain Linear Regression model cannot do).

This made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree.

For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a^2, a^3, b^2, and b^3 but also the combinations ab, a^2b, and ab^2.

PolynomialFeatures (degree=d) transforms an array containing n features into an array containing (n + d)!/d!n! features, where n! is the factorial of n, 1 x 2 x 3 x ... x n.

Beware of the combinatorial explosion of the number of features!

Learning Curves

Regularized Linear Models

Ridge Regression

Lasso Regression

Elastic Net

Early Stopping

f:id:AI_ML_DL:20200519233036p:plain — Figure 4-20. Early stopping regularization

Logistic Regression

Estimating Probabilities

Training and Cost Function

Decision Boundaries

Softmax Regression

Cross Entropy

f:id:AI_ML_DL:20200518085330p:plain — style=125 iteration=500

*1:100, 1

*2:2, 1

2020-05-15

Chapter 3 Classification

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

MNIST

MNISTの説明：手書き数字のデータベースで、機械学習の"hello world"である。

Scikit-Learnからダウンロードできる。

from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)

mnist.keys( )

dict_keys(["data", "target", "feature_names", "DESCR", "details", "categories", "url"])

X, y = mnist["data"], mnist["target"]

X.shape

(70000, 784)

y.shape

(70000,)

28 x 28 pixels, from 0 (white) to 255 (black)

画像の表示

import matplotlib as mpl

import matplotlib.pyplot as plt

some_digit = X[0]：後で何度も使われる

some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_mage, cmap="binary")

plt.axis("off")

plt.show( )

y[0]

'5'

MNISTは、訓練用とテスト用にすでに分けられており、シャッフルされており、cross-validationにも対応（分割してもその中でのラベルの分布は同じ）している。

X_train, X_test, y_train, y_test = X[ : 60000], X[60000 : ], y[ : 60000], y[60000 : ]

Training a Binary Classifier

問題を簡単にするために、1つの数字、例えば５、を認識することにしよう。５の検出、すなわち５と５以外の数字に分類する。

ここで、分類のためのラベル target vector を作る。

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits

y_test_5 = (y_test == 5)

y_train # ラベルを表示する

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

y_train_5 # ５と５以外の数字を分類するためのラベルを表示する。

array([ True, False, False, ..., True, False, False])

まずは、Stochastic Gradient Descent (SGD) classifier を使う。このclassifierは、非常に大きなデータセットに対して効率よく動作する。online learningにも適している。

60000の全トレーニングセットに適用する。

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, y_train_5)

sgd_clf.predict([some_digit]) # some_digit = X[0], ---> sgd_clf_predict([X[0]])

array([True])

Performance Measures

分類モデルの性能評価は、回帰モデルの性能評価よりも、トリッキーなので、多くのスペースを割いて説明する、とのこと。

Measuring Accuracy Using Cross-Validation

ここでも、cross validation が有効である。

from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import cross_val_score

sgd_clf = SGDClassifier(randomstate=42)

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.96355, 0.93795, 0.95615)

凄いな、と思う前に、どれも５ではない、という答えしかできないモデルを使った分類モデルを、cross validationによって評価してみよう。

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):

def fit(self, X, y=none):

return self

def predict(self, X):

return np.zeros*1

ovr_clf.fit(X_train, y_train)

ovr_clf.predict([some_digit])

array([5], dtype=unit8)

len(ovr_clf.estimators_)

SGDClassifierをtraining

sgd_clf.fit(X_train, y_train)

sgd_clf.predict([some_digit])

array([5], dtype=unit8)

SGD classifierは、もともとmultiple classifierなので、OvRやOvOは動作しない。

decision_function( )により、スコアを出力してみよう。

sgd_clf.decision_function([some_digit])

array(-15955, -38080, -13326, 573, -17680, 2412, -25526, -12290, -7946, -10631)

０から９までのスコアであり、５のスコアが最大値を示していることがわかる。

それではcross_val_score( )を計算してみよう。

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.849, 0.871, 0.870])

この１から９までの分類においては、ランダム分類では10％のスコアだから、85％はまずまずの値である。

2章で学んだscaling the inputを行ってみよう。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler( )

X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

array([0.897, 0.896, 0.907])

入力データをスケーリングするだけで、84％から90％までスコアが上がった。

こういうデータ処理scalingは、忘れないようにしよう！

Error Analysis

もしこれが実際のプロジェクトであれば、チェックリストに従って進めているところだろう。複数のモデルを試して、候補が絞られてきたら、GridSearchCVを使ってhyperparameterをfine-tuningするところだろう。有望なモデルが見つかったら、それをさらに改善するには、どういうエラーをしているかを調べるのも１つの方法である。

confusion matrixを調べてみよう。

cross_val_predict( )で予測してから、confusion_matrix( ) functionを呼び出せばよい。

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)

conf_mx = confusion_matrix(y_train, y_train_pred)

conf_mx

array([[5577, 0, 22, 5, 8, 43, 36, 6, 225, 1],
[ 0, 6400, 37, 24, 4, 44, 4, 7, 212, 10],
[ 27, 27, 5220, 92, 73, 27, 67, 36, 378, 11],
[ 22, 17, 117, 5227, 2, 203, 27, 40, 403, 73],
[ 12, 14, 41, 9, 5182, 12, 34, 27, 347, 164],
[ 27, 15, 30, 168, 53, 4444, 75, 14, 535, 60],
[ 30, 15, 42, 3, 44, 97, 5552, 3, 131, 1],
[ 21, 10, 51, 30, 49, 12, 3, 5684, 195, 210],
[ 17, 63, 48, 86, 3, 126, 25, 10, 5429, 44], [ 25, 18, 30, 64, 118, 36, 1, 179, 371, 5107]], dtype=int64)

図で表示すると、

plt.matshow(conf_mx, cmap=plt.cm.gray)

plt.show( )

f:id:AI_ML_DL:20200517145235p:plain

５の位置のイメージが他より若干暗い。５のクラスの数が少ないか、または、分類精度が良くない場合に暗くなるのだが、今回は両方に当てはまっている。

誤認にfocusするために、error rateをプロットする。

row_sums = conf_mx.sum(axis=1, keepdims=True)

norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)

plt.matshow(norm_conf_mx, cmap=plt.cm.gray)

plt.show( )

f:id:AI_ML_DL:20200517161500p:plain

８の列が明るいのは、8に誤分類される割合が高いということを示唆している。

class ８は、特に明るい領域は無いので、適切に分類されているようである。

３と５の明暗が対照的であることから、互いに誤認されていることがわかる。

3に分類されたものと5に分類されたを、クラス３とクラス５について表示する。

cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()

f:id:AI_ML_DL:20200517163122p:plain

左半分が３に分類され、右半分が5に分類されたものである。

大半が、明らかな分類ミスのように見えるので、ここから改善策を見つけるのは難しそうである。

前処理として、文字の傾きや回転を修正する、センタリングする、というのは効果があるかもしれない。

Multilabel Classification

顔認証の場合は、1枚の画像中に、複数あるうちのどの顔があるかを予測する。

例題は、MNISTデータベースの場合で、１つのinstanceの１つの文字に対して、7以上か否かと、奇数か否かを予測するものである。

この２つのケースはまったく異なっているように見えるが、１つのinstanceに対して複数のラベルを与えるということにおいては同じである。

コードをみてみよう。

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)

y_train_odd = (y_train % 2 == 1)

y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier( )

knn_clf.fit(X_train, y_multilable)

KNeighborsClassifierは、multiclassificationをサポートしているが、すべてのclassifierがmulticlassificationに対応しているわけではない。

予測してみよう。

knn_clf.predict([some_digit])

array(False, True)

some_difitは５なので、not large (False), odd (True)ということで正しく学習し、予測している。

スコアを算出してみよう。

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilable, cv=3)

f1_score(y_multilable, y_train_knn_pred, average="macro")

0.97641

lableに重みづけしてスコアを算出するには、スコアの計算で、average="weighted"とすればよい。

Multioutput Classification

これは、興味深い。

文字画像に下図に示すようにノイズをのせる。

noise = np.random.randint(0, 100, (len(X_train), 784))

X_train_mod = X_train + noise

noise = np.random.randint(0, 100, (len(X_test), 784))

X_test_mod = X_test + noise

y_train_mod = X_train

y_test_mod = X_test

f:id:AI_ML_DL:20200517214518p:plain

学習は、KNeighborsClassifierを用い、ノイズの無い画像をラベルにして行う。

knn_clf.fit(X_train_mod, y_train_mod)

clean_digit = knn_clf.predict([X_test_mod[some_index]])

plot_digit(clean_digit

f:id:AI_ML_DL:20200517214542p:plain

ノイズ除去ができるなんて、驚きだ。

デザイン・設計次第で何でもできそうな気がしてくるのが、機械学習の面白いところだ。ディープラーニングもいいが、機械学習もなかなか面白いものだなと感じてきた。
これまで、機械学習は過去の遺物で、ディープラーニング全盛の時代だと思っているが、まだまだ使えるのかな。それとも、最先端では、・・・。

Exercises

3. Tackle the Titanic dataset.

初心に戻ろうと思って5月8日にKaggleのTitanicに取り組んだ。

チュートリアルに従って、一通りの手順をおさらいした後、notebookを参考に勉強しようとしたが、ついていけるコードは学ぶことが殆どなく、良さそうなnotebookは、Rだったり、C++だったり、Pythonでもコードが理解できなかったりで、いつもの壁にぶち当たった。

そこで、A. Geronさんのテキストに取り組み始めた。

2章をクリヤ（Exercisesは未着手）し、3章もクリヤ（Exercisesは未着手）し、4章に進もうと思ったが、機械学習を習得できるか否かは、A. Geronさんの、このテキストを読み込むことができるか否かにかかっているような気がしてきた。

2章、3章と読み進めるにしたがって、このテキストの良さがわかってきた。

ならば、Exercisesにもちゃんと取り組まないと本当に理解したことにはならないのではないかと思い、Exercisesに取り組むことにした。

という状況で第3章のExercisesを眺めていたら、KaggleのTitanicが取り上げられているではないか。

これは、good timing である。

とりあえず、コードを最初の方から、眺めてみよう。

train_data = load_titanic_data("train.csv")

test_data = load_titanic_data("test.csv")

train_data.head( )

train_data.info( )

train_data.describe( )

train_data["Survived"].value_counts()

0 549

1 342

Name: Survived, dtype: int64

train_data["Pclass"].value_counts( )

3 491

1 216

2 184

Name: Pclass, dtype: int64

train_data["Sex"].value_counts( )

male 577

female 314

Name: Sex, dtype: int64

train_data["Embarked"].value_counts( )

S 644

C 168

Q 77

Name: Embarked, dtype:int64

ここから本格的な前処理が始まる。

ところで、3章で学習したClassifierは、データが単純だった。５かそれ以外の数字か、という分類であった。分類の根拠は、数字のピクセル情報に求めるしかないのだが、モデルが数字をどうやって分類しているのか、その根拠がわからなかった。

Titanicもclassifierだが、分類の根拠となりそうな情報attributeがたくさん示されている。そうすると、2章でやったhousingのように、目的とする情報との相関性を調べることが重要だと思うのだが、それは、ここでのexaciseの主題ではなく、自分で検討しなさいということのようだ。

とりあえず、Exerciseにもどろう。

使われるのは、Pipeline, FeatureUnion, DataFrameSelectorとのこと。

よくわかんないね。

pipelimeってなんだったかな。38ページの囲み記事。

A sequence of data processing components is called a data pipeline.

実例は70ページと71ページ

num_pipeline = Pipeline([

('imputer', SimpleImputer(strategy="median")),

('attribs_adder', CombinedAttributesAdder( )),

('std_scaler', StandardScaler( )),

])

housing_num_tr = num_pipeline.fit_transform(housing_num)

--------------------------------------------

num_attribs = list(housing_num)

cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([

("num", num_pipeline, num_attribs),

("cat", OneHotEncoder( ), cat_attribs),

])

housing_prepared = full_pipeline.fit_transform(housing)

ipelineはよく使いそうだが、FeatureUnionとかDataFrameSelectorはどうなんだろう。

Titanicでもpipelineは、大活躍。そこにFeatureUnionとDataFrameSelectorが出てくる。

num_pipeline = PipeLine([

("select_numeric", DataFrameSelector(["Age", "SibSup", "Parch", "Fare"])),

("imputer", SimpleImputer(strategy="median")),

])

cat_pipeline =PipeLine([

("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),

("imputer", MostFrequentImputer( )),

("cat_encoder", OneHotEncoder(sparse=False)),

])

preprocess_pipeline = FeatureUnion(transformer_list=[

("num_pipeline", num_pipeline),

("cat_pipeline", cat_pipeline),

])

X_train = preprocess_pipeline.fit_transform(train_data)

こうやって並べてみると、pipelineが中心にいるのがよくわかるので、とにかく、いつでもどこでもpipelineということで、pipelineの習得に努めていく。

ここでの、Titanicの課題に対する、モデルの比較を見ておこう。

svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)

svm_scores.mean( )

0.7365

forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)

forest_scores.mean( )

0.8150

cross validationといっても、元のデータ数が少ないので、overfittingになる可能性があるので、このままテストデータに適用しても、0.80は超えないだろうと推測される。

Scikit-LearnのBaseEstimatorとTransformerMixinがわからない。

Exercise 4. Build a spam classifier (a more challenging exercise)

コードは追えなくても、3章までの知識で、どの程度のことができるのかは知っておきたい。attributeが、メールの構造、単語の種類、数、並びだったりするようで、その解析ために、mailやNLTKが使われる。データの解析、前処理、訓練データの作成、訓練、予測、評価、・・・。

First, let's fetch the data:

import os
import tarfile
import urllib

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=SPAM_URL, spam_path=SPAM_PATH):
if not os.path.isdir(spam_path):
os.makedirs(spam_path)
for filename, url in *2:
path = os.path.join(spam_path, filename)
if not os.path.isfile(path):
urllib.request.urlretrieve(url, path)
tar_bz2_file = tarfile.open(path)
tar_bz2_file.extractall(path=SPAM_PATH)
tar_bz2_file.close()

fetch_spam_data( )

let's load all the emails:

HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

len(ham_filenames)

2500

len(spam_filenames)

500

We ca use Python's email module to parse these emails (this handles headers, encoding, and so on):

import email
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
directory = "spam" if is_spam else "easy_ham"
with open(os.path.join(spam_path, directory, filename), "rb") as f:
return email.parser.BytesParser(policy=email.policy.default).parse(f)

ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

Let's look at one example of ham and one example of spam, to get a feel of what the data looks like:

print(ham_emails[1].get_content( ).strip( )

print(spam_emails[6].get_content( ).strip())

Some emails are actually multipart, with images and attachiments (which can have their own attachments). Let's look at the various types of structures we have.

メールの構造に違いがありそうだということだが、どうやってメールの構造を解析するのだろう。

def get_email_structure(email):

if isinstance(email, str):

return email

payload = email.get_payload( )

if isinstance(payload, list):

return "multipart({})".format(", ".join([

get_email_structure(sub_email)

for sub_email in payload

]))

else:

return email.get_content_type( )

--------------------------------------------

from collections import Counter

def structures_counter(emails):

structures = Counter( )

for email in emails:

structure = get_email_structure(email)

structures[structure] += 1

return structures

structures_counter(ham_emails).most_common( )

structures_counter(spam_emails).most_common( )

It seems that the ham emails are more often plain text, while spam has quite a lot of HTML. Moreover, quite a few ham emails are sighned using PGP , while no spam is. In short, it seems that the email structure is useful information to have.

こうしている間にも、spam mailが送られてきた。

有用だと思ってこちらからメールアドレスを登録した場合でも、目的外の勧誘メールを送ってくることがある。これもspamに分類したいね。

次はヘッダーを調べている。

for header, value in spam_emails[0].items( ):

print(header, ":", value)

spam_emails[0]["Subject"]

'Life Insurance - Why Pay More?'

yahooなんかも、この種の見出しの広告であふれている。

見ないのが一番なのだが、テレビもバカな広告であふれている。

Okay, before we learn too much about the data, let's not forget to split it to a training set and a test set:

import numpy as np

from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)

y = np.array([0] * len(ham_emails) + [1] * len(spam_emaila))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

HTMLをplain textに変換するには、BeautifulSoupe libraryというものがあるらしいが、ここでは、教育的配慮もあって、カスタムメードでいくようだ。

Okay, let's start writing the preprocessing functions. First, we will need a function to convert HTML to plain text. Arguably the best way to do this would be to use the great BeautifulSoupe library, but I would like to avoid adding another dependency to this project, so let's hack a quick & dirty solution using regular expressions.

The following function first drops the <head> section, then converts all <a> tags to the word HYPERLINK, then it gets rid of all HTML tags, leaving only the plain text. For readability, it also replaces multiple newlines with single newlines, and finally it unescapes html entities (such as > or   ):

import re

from html import unescape

def html_to_plain_text(html):

text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)

text = re.sub('<a¥s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)

text = re.sub('<.*?>', '', text, flags=re.M | re.S)

text = re.sub(r' (¥s*¥n)+' , ' ¥n', text, flags=re.M | re.S)

return unescape(text)

プログラムコードを書く（作る）のは面倒だが、ヒトは、もっと面倒な作業を毎日繰り返している。プログラムを組んでしまえば、一瞬で終わってしまう作業かもしれないのに！

html_spam_emails = [email for email in X_train[y_train==1]

if get_email_structure(email) == "text/html"]

sample_html_spam = html_spam_emails[7]

print(sample_html_spam.get_content( ).strip( )[:1000], "...")

----------------------------------------------------------------------

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b> Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners </b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0"

--------------------------------------------------------------------------

このHTMLをテキストに変換する。

print(html_to_plain_tezt(sample_html_spam.get_content( ))[:1000], "...")

-----------------------------------------------------------------------

OTC
Newsletter
Discover Tomorrow's Winners
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI. CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts. IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi ...

-----------------------------------------------------------------------------

HTMLとtextの相互変換のニーズが高いのか、検索すると、ウエブ上で変換できるサイトがいくつも出てきた。上記のHTMLを試してみたら、textの最初の3行が出力された。

-----------------------------------------------------------------------------

つづく

f:id:AI_ML_DL:20200515165810p:plain — style=124 iteration=1

f:id:AI_ML_DL:20200515165710p:plain — style=124 iteration=20

f:id:AI_ML_DL:20200515165513p:plain — style=124 iteration=500

*1:len(X), 1), dtype=bool)

never_5_clf = Never5Classifier( )

cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.91125, 0.90855, 0.90915])

90％は、５ではないのだから、予測結果を、全て５ではない、としておけば、約90％の正解率になるのは当然である。

簡単な例であるが、accuracyの数値だけで判断してはいけないことが、よくわかる。

MNISTは各ラベルの分布は均等になるように設計されているが、一般的には偏りがある（ラベルの割合が異なる）ので、ベースラインを抑えることは重要である。

Titanicでは、女性だけが生存すると予測すれば、約75％の正解率になるわけだから、75％より高い値になるモデルを構築しなければならないということになる。

Confusion Matrix

分類の正確さを見積もる良い方法として、confusion matrixの考え方がある。

AがAと判定される場合の数とBと判定される場合の数から評価する方法である。

予測の正確さの評価には予測結果が必要だが、テストデータを使ってはいけない。クロスバリデーションを使う。ただし、cross_val_scoreの代わりに、cross_val_predictを使う。

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

この予測値y_train_predと、ラベルy_train_5から、confusion matricを算出する。

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

array([[53057, 1522],

[ 1325, 4096]])

5であることをpositive、5でない数値であることをnegativeと定義していれば、

53057は、5でない数値を5でないと正しく評価したということで、TN:true negative、

4096は、5を正しく5であると評価したということで、TP:true positiveとなる。

1522は、5でない数値を5であると誤って評価したということで、FP:false positive

1325は、5を5でないと誤って評価したということで、FN:false negativeとなる。

perfect classifierによるconfusion matrixは、

y_train_perfect_predictions = y_train_5 # pretand we reached perfection

confusion_matrix(y_train_5, y_train_perfect_predictions)

array([[54579, 0],

[ 0, 5421]])

このconfusion matrixからも評価できるが、precisionやrecallが以下のように定義されて、用いられている。

precision=TP/(TP+FP)

recall=TP/(TP+FN)

このrecallは、sensitivityとかtrue positive rateと呼ばれている。

Precision and Recall

sklearn.metricsには、precision_scoreとrecall score、さらにfi_scoreを計算する関数がある。Fiはprecisionとrecallのharmonic meanである。

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522)

0.729085

recall_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1325)

0.755580

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522/2 + 1325/2)

0.742096

どのスコアを選ぶかは、分類の目的次第とのことである。

たとえば、最近の例では、新型コロナの保菌者かどうかを判定するPCR検査がある。保菌者は陽性Posotiveと判定される。陽性と判定されても保菌者でない場合は、偽陽性FPである。保菌者が陰性と判定されると偽陰性FNである。偽陰性では、入院できず十分な治療が受けられないという深刻な事態になる可能性がある。こういう判定の場合は、保菌者が陰性と判定される場合の数を直接反映するrecallを装置の指標にするのがよいということになる。

precisionとrecallは、trade-offの関係にあることが多い。

Precision/Recall Trade-off

MNISTのデータベースを用いて、5とそれ以外の数字に分類する、ということを目的として話が進められている。

この分類には、SGDClassifireが用いられている。

分類は、decision functionに基づいて各instanceのスコアを計算し、しきい値より大きいか小さいかで分類している。94ページに、しきい値を中心に12個のinstanceをスコア順に並べ、しきい値が変化すると、precisionとrecallがどのように変化するかを、視覚に訴えながら説明している。

スコアの小さい左側から、８、７、３、９、５、２、５、５、６、５、５、５、と並んでいる。

しきい値が左方の９と５の間にあれば、precisionは、分母がしきい値より右側のすべてのinstanceだから８個、分子は５の個数だから6個となり、6/8で75％である。このときrecallは、分母が５の個数だから6個で、分子も6個となり、100％になる。

しきい値が、最右方の６と5の間にあれば、precisionは3/3で100％、recallは、3/6で50％となる。

このように、分子はしきい値の右側にある５の数だからprecisionもrecallも同じで、precisionの分母は、しきい値の右側にある全てのinstanceの個数だが、recallの分母は、しきい値の位置によらず、すべての５の個数である。

以上のモデルから、precisionとrecallのtrade-offの関係を、少し理解できた気がする。

Scikit-Learnでは、このしきい値を直接、設定することはできないが、アクセスすることはできる。

decision_function( )を使えば、各instanceのスコアがわかる（出力される）。

y_scores = sgd_cls.decision_function([some_digit]) # some_digit=X[0]

y_scores

array([2412.5317])

これがX[0]のスコアで、他の５について調べてみると、スコアが、3939や4742となっていた。

threshold = 0

y_some_digit_pred = (y_scores > threshold)

y_some_digit_pred

array([true])：しきい値を0とすると、5に分類された。

threshold = 8000

y_some_digit_pred = (y_scores > threshold)

y_some_digit_pred

array([False])：しきい値を8000とすると、５に分類されない。

しきい値によって、precisionとrecallが変化することが分かった。

95ページには、しきい値を横軸に、縦軸にprecisionとrecallをプロットした図がある。precisionとrecallが逆の動きをすることがよくわかる。

さらに、96ページには、横軸にrecallを、縦軸にprecisionをプロットした図が示されている。

The ROC Curve

ROC curveもbinary classificationでよく用いられる道具である。

横軸が false positive rate、縦軸が、true positive rate (recall)である。

この曲線ROCの下の面積AUCは、分類器の性能の指標として用いられる。

ROC曲線は次のようにして描かれる。（一部省略）

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, lable=None):

plt.plot(fpr, tpr, linewidth=2, lable=lable)

plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal

plot_roc_curve(fpr, tpr)

plt.show( )

ランダムな場合に直線となり、性能が高いほど、左上隅に近づく。

曲線下の面積AUCは、ランダムが0.5、完全分類が1.0となる。

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

0.96117788

96ページのPR曲線（Precision/Recall curve）とROC曲線の使い分け。

経験則として、precision/recall:PRは、positive classが少ないとき（5の数字とそれ以外の数字の分類のように）、または、false negativeよりfalse positiveの方が重要な場合に用いる。それ以外の場合はROCを用いる。

さて、RandomForestClassifierモデルを訓練して、ROC曲線とROC AUC scoreをSGDClassifierと比較してみよう。

RandamForestClassifierはdecision_function( ) methodを持っていないが、そのかわり、predict_proba( ) methodを持っている。

from sklearn.ensamble import RandomForestClassifier

forest_clf = RandomForestClassifier(randomstate=42)

y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,

method="predict_proba")

y_probas_forest = y_probas_forest[ : , 1] # score = proba of positive class

fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train_5, y_scores_forest)

plt.plot(fpr, tpr, "b:", lable="SGD")

plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")

plt.legend(loc="lower right")

plt.show( )

RandomForestClassifierのROC curcurveはSGDClassifierよりもさらに左上隅にある。

ROC AUCを計算してみると、

roc_auc_score(y_train_5, y_scores_forest)

0.99834367

さらに、precisionとrecallは、

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)

0.9905083

recall_score(y_train_5, y_train_pred_forest)

0.8662608

悪くない結果だ！

ここまで、binary classifierをみてきた。適切な評価法を選ぶ、cross-validationでclassifierを評価する、precision/recallを使い分ける、ROC curveとROC AUC scoreをモデルを比較するために用いる。次は、５以外の数字を検出しよう。

Multiclass Classification

SGD Classifiers, Random Forest Classifiers, naive Bayesなどは、もともと、multiple classes に対応している。

Logistic RegressionやSupport Vector Machineなどは、binary classifierである。

しかしながら、binary classifierも、multiple binary classifiersとして、multiple classificationは可能である。

one-versus-the-rest:OvR, one-versus-all, one-versus-one:OvO

binary classifierをmulticlass classifierとして使うにはいろいろな手法があるが、

Scikit-Learnは、multi-class classificationにbinary classification algirithmを使っても、algorithmによって、自動的に、OvRとOvOを使い分けている。

実際に、SVMをMNISTの多値分類（0-9）に適用してみよう。

from sklearn.svm import SVC

svm_clf = SVC( )

svm_clf.fit(X_train, y_train) # y_train, not y_train_5

svm_clf.predict([some_digit]) # some_digit : X[0] : [5]

array([5], dtype=unit8)

y_trainを使っているので、0 - 9 original target classesである。

内部では、OvO strategyが用いられ、45のbinary classifiersが訓練され、各instanceに対してdecision scoreが計算される。

some_digit_scores = svm_clf.decision_function([some_digit]) # some_digit : [5]

some_digit_scores

array( [ [ 2.92, 7.02 3.94, 0.90, 5.97, 9.50, 1.91, 8.03, -0.13, 4.94] ] )

5のスコア：9.50が最も高くなっていることが確認できる。

念のため、

np.argmax(some_digit_scores)

svm_clf.classes_

array( [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=unit8)

svm_clf.classes_[5]

Scikit-LearnはOvOかOvRかを自動的に選んでいるが、指定もできる。

from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(

*2:"ham.tar.bz2", HAM_URL), ("spam.tar.bz2", SPAM_URL

2020-05-13

Titanic: Machine Learning from Disaster - 2

A. Geronさんのテキスト第2章の勉強の成果を試そう！

意気込んでとりかかったのはよいが、いきなり、つまづいた。

A. Geronさんのテキストで学んだhousingは、住宅価格を予測するということで、回帰問題、regressionである。

対して、Titanicは、乗客が生存するかどうか予測するという、分類問題、classificationである。

housingは主として数値を扱っていて、唯一のカテゴリーデータocean_proximityは、OneHotEncoderで配列に変換して、回帰問題として扱い、機械学習モデルで扱えるようにした、ということだが、住宅価格の予測に対して、どう作用しているのか、直観的には理解できていない。

分類問題において、数値とカテゴリー（テキスト）はどう扱われているのか。これまでは、明確に意識してこなかったので、今回は、分類問題における数値とテキストの扱い方に注意しながら学習しよう。

データ数は、Titanicが、train:891, test:418、housingがtotal:20640である。つまり、Titanicのデータ数は、housingの6.3％である。

このデータ数が少ないことは、overfittingしやすいことを意味している。

Alexis Cook's Titanic Tutrialでは、

features = ["Pclass", "Sex", "SibSp", "Parch"]

この4つのattributeを使って、RandomForestClassifierを学習させている。

Pclass（等級）, SibSup（兄弟または配偶者の人数）, Parch（両親または子供の人数）は、整数、Sexは文字列である。

文字列と整数ならOKということかな。

そうではなく、pd.get_dummys( )を使うことによって、自動的に、

Sex_femaleとSex_maleに分けていて、情報量で表現しているようだ。

X = pd.get_dummies(train_data[features])

Pclass SibSp Parch Sex_female Sex_male
0 3 1 0 0 1
1 1 1 0 1 0
2 3 0 0 1 0
3 1 1 0 1 0
4 3 0 0 0 1

分類用のprint(train.columns)機械学習モデルの性能を生かすために、テキストを情報量で表現しなおすこと、数値を、3～10段階くらいにクラス分けして、整数で表現するなどの操作を行なっているようである。

少しコードを拝借してみようかな。

print(train.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

print("Percentage of females who survived:", train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1]*100)

Percentage of females who survived: 74.20

print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)

Percentage of Pclass = 1 who survived: 62.96

#sort the ages into logical categories

train["Age"] = train["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()

年齢を分割して、カテゴリーに変換している。

関係なさそうなattributeは消去

train = train.drop(['Cabin'], axis = 1)

train = train.drop(['Ticket'], axis = 1)

すごいなと思ったのは、Nameから、Titleを抽出して、生存率を調べたり、Titleと年齢の相関から、年齢がNaNとなっている人の年齢をTitleから推定したり・・・。

高いスコアを出している人は、データを隅々まで調べて、生存率との相関をとっていることと、生存率との相関が認められないものは捨てている。最後に、予測精度が高い人は、trainとtestをきちんと切り分けているようにみえる。

Titanicコンペのnotebookに関するdiscussionを読んでいて感じたこと。

グリッドサーチによるハイパーパラメータの最適化ができていない。

アンサンブル学習を利用している人が少ない。

RやC++の使い手が凄い。

まあ、初心者用コンペなので、トップコーダーたちは、あまり顔を出していない。

5月15日：A. Geronさんのテキストの第2章の復習等

housing.head( )

housing.tail( )

housing.info( )

housing["ocean_proximity"].value_counts( )

housing.describe( )

%matplotlib inline

housing.hist(bins=50, figsize=(20,15))

plt.show( )

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

housing["income_cat"] = pd.cat(housing["mediun_income"],

bins=[0., 1.5, 3.0, 4.5, 6., np.inf],

lables=[1, 2, 3, 4, 5])

housing["income_cat"].hist( )

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):

strat_train_set = housing.loc[train_index]

strat_test_set = housing.loc[test_index]

strat_test_set["income_cat"].value_counts( ) / len(strat_test_set)

for set_ in (strat_train_set, strat_test_set):

set_.drop("incom_cat", axis=1, inplace=True)

housing = strat_train_set.copy( )

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,

s=housing["population"]/100, lable="population", figsize=(10,7),

c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)

plt.legend( )

corr_matrix = housing.corr( )

corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value 1.000000

median_income 0.687170

total_rooms 0.135231

housing_median_age 0.114220

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_incom", "total_rooms", "housing_median_age]

scatter_matrix(housing[attributes], figsize=(12,8))

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)

# create new attribute

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/housing["households"]

# prepare the data

housing = strat_train_set.drop("median_house_value", axis=1) # X_train

housing_lables = strat_train_set["median_house_value"].copy( ) # y_train

# data cleaning

housing.dropna(subset=["total_bedrooms"]) # option 1

housing.drop("total_bedrooms", axis=1) # option 2

median = housing["total_bedrooms"].median( ) # option 3

housing["total_bedrooms"].fillna(median, inplace=True)

# take care of missing values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

# median can only computed on numerical attributes

housing_num = housing.drop("ocean_proximity", axis=1)

imputer.fit(housing_num)

X = imputer.transform(housing_num)

# handling text and categorical attributes

housing_cat = housing"ocean_proximity"

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder( )

housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder( )

housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

# custom transformers

from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttributesAdder(BaseEstimator, TransformerMixin)

# feature scaling

min-max scaling (many people call this normalization)

standardization

# transformation pipelines

# training and evaluating on the training set

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_square_error

from sklearn.tree import DecisionTreeRegressor

# cross-validation

from sklearn.model_selection import cross_val_score

tree_reg = DecisionTreeRegressor( )

scores = cross_val_score(tree_reg, housingprepared, housing_lables,

scoring="neg_mean_squared_error", cv=10)

from sklearn.ensemble import RandomForestRegressor

forest_reg = RnadomForestRegressor( )

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
　　　　　　　　　　　　　 scoring="neg_mean_squared_error", cv=10)

# fine-tune

# grid search

おわり

f:id:AI_ML_DL:20200513183405p:plain — style=123 iteration=1

f:id:AI_ML_DL:20200513183511p:plain — style=123 iteration=20

f:id:AI_ML_DL:20200513183235p:plain — style=123 iteration=500

AI_ML_DL’s diary

人工知能、機械学習、ディープラーニングの日記

Chapter 8 Dimensionality Reduction

Chapter 7 Ensemble Learning and Random Forests

Chapter 6 Decision Trees

Chapter 5 Support Vector Machines

Chapter 4 Training Models

Chapter 3 Classification

Titanic: Machine Learning from Disaster - 2