Machine Learning without DNN - 1

・テキストは、A. Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, March 2017: First Edition, O'Reilly

・昨年、第2版が出版されたが、手元にあるのは初版である。

・Chapter 7 : Ensemble Learning and Random Forestsと、XGBoost, LightGBMなどを勉強しよう。

1. The Machine Learning Landscape

What Is Machine Learning?

Machine Learning is the science (and art) of programming computers so they can learn from data.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. ---Tom Mitchell, 1997

For example, your spam filter is a Machine Learning program that can learn to flag spam given example of spam emails (e.g., flagged by users) and example of regular (nonspam, also called "ham") emails. The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample). In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy and it is often used in classification tasks.

If you just downloaded a copy of Wikipedia, your computer has a lot more data, but it is not suddenly better at any task. Thus it is not Machine Learning.

Why Use Machine Learning?

To summarize, Machine Learning is great for:

・Problems for which existing solutions require a lot of hand-tuning or long list of rules: one Machine Learning algorithm can often simplify code and perform better.

・Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.

・Fluctuating environments: a Machine Learning system can adapt to new data.

・Getting insights about complex problems and large amounts of data.

Types of Machine Learning Systems

There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:

・Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)

・Whether or not they can learn incrementally on the fly (online versus batch learning)

・Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientist do (instance-based versus model-based learning)

Supervised/Unsupervised Learning

Supervised learning

In supervised learning, the training data you feed to the algolism includes the desired solutions, called lables.

A typical supervised learning task is classification. The spam filter is a good example of this: it is trained with many example emailes along with their class (spam or ham), and it must learn how to classify new emailes.

Another typical task is to predict a target numeric value, such as a price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression. To train the system, you need to give it many examples of cars, including both their predictors and their lables (i.e., their prices).

Here are some of the most important supervised learning algorithms (covered in this book):

・k-Nearest Neighbors

・Linear Regression

・Logistic Regression

・Support Vector Machines (SVMs)

・Decision Trees and Random Ferests

・Neural networks

Unsupervised learning

In unsupervised learning, as you might guess, the training data is unlabled. The system tries to learn without a teacher.

Here are some of the most important unsupervised learning algolisms (we will cover dimensionality reduction in Chapter 8):

・Clustering

- k-Means

- Hierarchical Cluater Analysis (HCA)

- Expectation Maximization

・Visualization and dimensionality reduction

- Principal Component Analysis (PCA)

- Kernel PCA

- Locally-Linear Embedding (LLE)

- t-distributed Stochastic Neighbor Embedding (t-SNE)

・Association rule learning

- Apriori

- Eclat

Semisupervised learning

Some algorithms can deal with partially labled training data, usually a lot of unlabled data and a little bit of labled data. This is called semisupervised learning.

Reinforcement Learning

Batch and Online Learning

Batch learning

Online learning

Instance-Based Versus Model-Based Learning

・・・・・・・・・・

CHAPTER 7

Ensenble Learning and Random Forests

Voting Classifiers

The following code creates and trainis a voting classifier in Scikit-Learn, composed of three classifiers (the training set is the moons dataset, introduced in Chapter 5):

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import VotingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

log_clf = LogisticRegression()

rnd_clf = RandomForestClassifier()

svm_clf = SVC()

voting_clf = VotingClassifier(

estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

voting_clf.fit(X_train, y_train)

Let's look at each classifier's accuracy on the test set:

>>> from sklearn.metrics import accuracy_score

>>> for clf in (log_clf, rnd_clf, svm_clf, voting_clf):

>>> clf.fit(X_train, y_train)

>>> y_pred = clf.predict(X_test)

>>> print(clf.__class__.__name__, accuracy_score(y_test, y_pred)

>>>

LogisticRegression 0.864

RandomForestClassifier 0.872

SVC 0.888

VotingClassifier 0.896

Ther you have it! The voting classifier slightly outperforms all the individual classifiers.

If all classifiers are able to estimate class probabilities (i.e., they have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.

This is called soft voting.

It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

All you need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can estimate class probabilities.

This is not the case of the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method).

If you modify the preceding code to use soft voting, you will find that the voting classifier achieved over 91% accuracy!

Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training algorithms, as just discusses. Another approach is to use the same training algolithm for every predictor, but to train them on different random subset of the training set.

When sampling is performed with replacement, this method is called bagging.

When sampling is performed without replacement, it is called pasting.

Bagging and Pasting in Scikit-Learn

The followin g code trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set boostrup=False).

The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (-1 tells Scikit-Learn to use all available cores):

from sklearn.ensemble import BaggingClassifier

from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(

DecisionTreeClassifier(), n_estimators=500,

max_samples=100, boostrap=True, n_job=-1)

bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)

The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifications.

Out-of-Bag Evaluation

Random Patches and Random Subspace

Random Forests

As we have discussed, a Random forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set.

Instead of building a BaggingClassifier and passing in a DecisionTreeClassifier, you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees (similarly,there is a RandomForestRegressor class for regression tasks).

The following code trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandonForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)

rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:

bag_clf = BaggingClassifier(

DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),

n_estimators=500, max_samples=1.0, bootsrap=true, n_jobs=-1)

Extra-Trees

It is hard to tell in advance whether a RandomForestClassifier will perform better or worser than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them using cross-validation (and tuning the hyperparameters using grid search)

Feature Importance

Yet another quality of Random Forests is that they make it easy to measure the relative importance of each feature.

Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest).

More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it (see Chapter 6).

Scikit-Learn computes this score automatically for each feature after training, that it scale the results so that the sum of all importances is equal to 1.

You can access the result using the feature_importances_ variable.

For example, the following code trains a RandomForestClassifier on the iris dataset (introduced in Chapter 4) and outputs each feature's importance.

It seems that tha most important features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison (11% and 2%, respectively).

>>> from sklearn.datasets import load_iris

>>> iris = load_iris()

>>> rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)

>>> rnd_clf.fit(iris["data"], iris["target"])

>>> for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):

>>> print(name, score)

>>>

sepal length (cm) 0.112492250999

sepal width (cm) 0.0231192882825

petal length (cm) 0.441030464364

petal width (cm) 0.423357996355

Similarly, if you train a Random Forest classifier on the MNIST dataset (introduced in Chapter 3) and plot each pixel's importance, you get the image represented in Figure 7-6.

Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

＊明日もこの続きで、Boostingまでやってしまおう。

つづく