AI_ML_DL’s diary

人工知能、機械学習、ディープラーニングの日記

Machine Learning without DNN - 2

Machine Learning without DNN - 2

 

ニューラルネットを使わない機械学習の勉強をする。

・テキストは、

A. Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, March 2017: First Edition, O'Reilly

・論文は、

Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset, E. AI-Daoud, International Journal of Computer and Information Engineering, 13, (2019) 6-10

LightGBM: A Highly Efficient Gradient Boosting Decision Tree, G. Ke et al., NIPS 2017

XGBoost: A Acalable Tree Boosting System, T. Chen and C. Guestrin, arXive 10 Jun 2016

CatBoost: gradient boosting with categorical features support, A. V. Dorogush et ai., arXiv 24 Oct 2018

・Ensemble Learning and Random Forestsと、XGBoost, LightGBM, CatBoostなどを勉強しよう。

 

*今日は、Boostingを勉強しよう。

 

Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can conbine several weak learners into a strong learner.

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

There are  many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.

・・・・・

skip AdaBoost

・・・・・

Gradient Boosting

Anothe vary popular Boosting algorithm is Gradient Boosting.

Just like Adaboost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.

However, instead of tweaking the instance weights at every iteration like AdaBoostdoes, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Let's go through a simple regression example using Decision Trees as the base predictors (of course Gradient Boosting also works great with regression tasks).

This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT).

First, Let's fit a DecisionTreeRegressor to the training set (for example, a noisy quadratic training set):

    from sklearn.tree import DecisionTreeRegressor

    tree_reg1 = DecisionTreeRegressor(max_depth=2)

    tree_reg1.fit(X, y)

Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:

    y2 = y - tree_reg1.predict(X)

    tree_reg2 = DecisionTreeRegressor(max_depth=2)

    tree_reg2.fit(X, y2)

 Then we train a third regressor on the residual errors made by the second predictor:

    y3 = y2 - tree_reg2.predict(X)

    tree_reg3 = DecisionTreeRegressor(max_depth=2)

    terr_reg3.fit(X, y3)

Now we have an ensemble containing three trees.

It can make predictions on a new instance simply by adding up the predictions of all the trees:

    y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

 

A simpler way to train GBRT (Gradient Boosted Regression Trees) ensemble is to use Scikit-Learn's GradientBoostingRegressor class.

Much like the RandomForestRegressor class, it has hyperparameters to control the growth of Decision Trees (e.g., max_depth, min_samples_leaf, and so on), as well as hyperparameters to control the ensemble training, such as the number of trees (n_estimators).

The following code creates the same ensemble as the previous one:

    from sklearn.ensemble import GradientBoostingRegressor

    gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)

    gbrt.fit(X, y)

 

*ここで、Scikit-learnについて思い出したことがある。誰か忘れたが、twitterで、scikit-learnのHPがずっと昔のままだ。こんな重要なツールを放置しておいてはいけない、というような内容の発言をしていた。

・ということで、HPをチェックした。

・記憶が間違っていなければ、大きく変わっている。

・ついでに、AnacondaのScikit-learnをアップデートしてみた。

・0.21.2 から0.22.1へのアップデートだったが、関連モジュール・パッケージが多くて、時間がかかった。

・さらに、気になるLightGBMで検索したら、次の記事があった。まだ実験段階だが、高速かつデータ欠損に対応しているようだ。ただし、予測精度が向上するとまでは記述されていない。

Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely HistGradientBoostingClassifier and HistGradientBoostingRegressor, inspired by LightGBM (See [LightGBM]).

These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.

They also have built-in support for missing values, which avoids the need for an imputer.

These fast estimators first bin the input samples X into integer-valued bins (typically 256 bins) which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees. The API of these estimators is slightly different, and some of the features from GradientBoostingClassifier and GradientBoostingRegressor are not yet supported: in particular sample weights, and some loss functions.

These estimators are still experimental: their predictions and their API might change without any deprecation cycle. To use them, you need to explicitly import enable_hist_gradient_boosting:

 

*テキストに沿って、基本的なところを学んでおこう。丸写しだけど!

The learning_rate hyperparameter scales the contribution of each tree.

If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the prediction will usually generalize better.

This is a regularization technique called shrinkage.

Figure 7-10 shows two GBRT ensembles trained with a low learning rate: the one on the left (lr=0.1, estimators=3) does not have enough trees to fit the training set, while the one on the right (lr=0.1, estimators=200) has too many trees and overfits the training set. 

In order to find the optimal number of trees, you can use early stopping (see Chapter 4). 

A simple way to implement this is to use the staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two tree, atc.).

The following code trains a GBRT (Gradient Boosted Regression Trees) ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:

    import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import mean_square_error

    X_train, X_val, y_train, y_val = train_test_split(X, y)

    gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)

    gbrt.fit(X_train, y_train)

    errors = [mean_squared_error(y_val, y_pred)

                  for y_pred in gbrt.staged_predict(X_val)]

    bst_n_estimators = np.argmin(errors)

    gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)

    gbrt_best.fit(X_train, y_train)

*これは、過学習が起きるまで学習させ、最適tree numberを見つけ、その最適tree numberで再度、学習させる方法であるが、次に示すのは、validation errorが改善されなくなったら、学習をやめ、その時点まで学習したモデルで、予測するという方法、いわゆるearly stoppingである。

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number).

You can do so by setting warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row:

    gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

    min_val_error = float("inf")

    error_going_up = 0

    for n_estimators in range(1, 120):

         gbrt.n_estimators = n_estimators

         gbrt.fit(X_train, y_train)

         y_pred = gbrt.predict(X_val)

         val_error = mean_squared_error(y_val, y_pred)

         if val_error < min_val_error:

             min_val_error = val_error

             error_going_up = 0

         else:

             error_going_up += 1

             if error_going_up == 5:

                 break  # early stopping

 

The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree.

For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly.

As you can probably guess by now, this trades a higher bias for a lower variance.

It also speeds up training considerably.

This technique is called Stochastic Gradient Boosting.

 

* XGBoost, LightGBM, CatBoost等について。

論文読んでいるけど、難しい。

背伸びしすぎず、今回のテーマは、いったん終了する。

 

*明日からは、Kaggleのコンペ、Deepfake Detection Challengeに挑戦してみよう。

 

f:id:AI_ML_DL:20200212071545p:plain

style=103 iteration=1

f:id:AI_ML_DL:20200212071643p:plain

style=103 iteration=20

f:id:AI_ML_DL:20200212071726p:plain

style=103 iteration=500