A. Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, March 2017: First Edition, O'Reilly
・論文は、
Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset, E. AI-Daoud, International Journal of Computer and Information Engineering, 13, (2019) 6-10
LightGBM: A Highly Efficient Gradient Boosting Decision Tree, G. Ke et al., NIPS 2017
XGBoost: A Acalable Tree Boosting System, T. Chen and C. Guestrin, arXive 10 Jun 2016
CatBoost: gradient boosting with categorical features support, A. V. Dorogush et ai., arXiv 24 Oct 2018
・Ensemble Learning and Random Forestsと、XGBoost, LightGBM, CatBoostなどを勉強しよう。
*今日は、Boostingを勉強しよう。
Boosting
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can conbine several weak learners into a strong learner.
The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.
There are many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.
・・・・・
skip AdaBoost
・・・・・
Gradient Boosting
Anothe vary popular Boosting algorithm is Gradient Boosting.
Just like Adaboost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
However, instead of tweaking the instance weights at every iteration like AdaBoostdoes, this method tries to fit the new predictor to the residual errors made by the previous predictor.
Let's go through a simple regression example using Decision Trees as the base predictors (of course Gradient Boosting also works great with regression tasks).
This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT).
First, Let's fit a DecisionTreeRegressor to the training set (for example, a noisy quadratic training set):
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)
Now train a second DecisionTreeRegressor on the residual errors made by the first predictor:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)
Then we train a third regressor on the residual errors made by the second predictor:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
terr_reg3.fit(X, y3)
Now we have an ensemble containing three trees.
It can make predictions on a new instance simply by adding up the predictions of all the trees:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
A simpler way to train GBRT (Gradient Boosted Regression Trees) ensemble is to use Scikit-Learn's GradientBoostingRegressor class.
Much like the RandomForestRegressor class, it has hyperparameters to control the growth of Decision Trees (e.g., max_depth, min_samples_leaf, and so on), as well as hyperparameters to control the ensemble training, such as the number of trees (n_estimators).
The following code creates the same ensemble as the previous one:
from sklearn.ensemble import GradientBoostingRegressor
They also have built-in support for missing values, which avoids the need for an imputer.
These fast estimators first bin the input samples X into integer-valued bins (typically 256 bins) which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees. The API of these estimators is slightly different, and some of the features from GradientBoostingClassifier and GradientBoostingRegressor are not yet supported: in particular sample weights, and some loss functions.
These estimators are still experimental: their predictions and their API might change without any deprecation cycle. To use them, you need to explicitly import enable_hist_gradient_boosting:
*テキストに沿って、基本的なところを学んでおこう。丸写しだけど!
The learning_rate hyperparameter scales the contribution of each tree.
If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the prediction will usually generalize better.
This is a regularization technique called shrinkage.
Figure 7-10 shows two GBRT ensembles trained with a low learning rate: the one on the left (lr=0.1, estimators=3) does not have enough trees to fit the training set, while the one on the right (lr=0.1, estimators=200) has too many trees and overfits the training set.
In order to find the optimal number of trees, you can use early stopping (see Chapter 4).
A simple way to implement this is to use the staged_predict()method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two tree, atc.).
The following code trains a GBRT (Gradient Boosted Regression Trees) ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees:
import numpy as np
from sklearn.model_selection import train_test_split
It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number).
You can do so by setting warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row:
The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree.
For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly.
As you can probably guess by now, this trades a higher bias for a lower variance.
It also speeds up training considerably.
This technique is called Stochastic Gradient Boosting.
・テキストは、A. Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, March 2017: First Edition, O'Reilly
・昨年、第2版が出版されたが、手元にあるのは初版である。
・Chapter 7 : Ensemble Learning and Random Forestsと、XGBoost, LightGBMなどを勉強しよう。
1. The Machine Learning Landscape
What Is Machine Learning?
Machine Learning is the science (and art) of programming computers so they can learn from data.
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. ---Tom Mitchell, 1997
For example, your spam filter is a Machine Learning program that can learn to flag spam given example of spam emails (e.g., flagged by users) and example of regular (nonspam, also called "ham") emails. The examples that the system uses to learn are called the training set. Each training example is called a training instance (or sample). In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy and it is often used in classification tasks.
If you just downloaded a copy of Wikipedia, your computer has a lot more data, but it is not suddenly better at any task. Thus it is not Machine Learning.
Why Use Machine Learning?
To summarize, Machine Learning is great for:
・Problems for which existing solutions require a lot of hand-tuning or long list of rules: one Machine Learning algorithm can often simplify code and perform better.
・Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
・Fluctuating environments: a Machine Learning system can adapt to new data.
・Getting insights about complex problems and large amounts of data.
Types of Machine Learning Systems
There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:
・Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)
・Whether or not they can learn incrementally on the fly (online versus batch learning)
・Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientist do (instance-based versus model-based learning)
Supervised/Unsupervised Learning
Supervised learning
In supervised learning, the training data you feed to the algolism includes the desired solutions, called lables.
A typical supervised learning task is classification. The spam filter is a good example of this: it is trained with many example emailes along with their class (spam or ham), and it must learn how to classify new emailes.
Another typical task is to predict a target numeric value, such as a price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression. To train the system, you need to give it many examples of cars, including both their predictors and their lables (i.e., their prices).
Here are some of the most important supervised learning algorithms (covered in this book):
・k-Nearest Neighbors
・Linear Regression
・Logistic Regression
・Support Vector Machines (SVMs)
・Decision Trees and Random Ferests
・Neural networks
Unsupervised learning
In unsupervised learning, as you might guess, the training data is unlabled. The system tries to learn without a teacher.
Here are some of the most important unsupervised learning algolisms (we will cover dimensionality reduction in Chapter 8):
Some algorithms can deal with partially labled training data, usually a lot of unlabled data and a little bit of labled data. This is called semisupervised learning.
The following code creates and trainis a voting classifier in Scikit-Learn, composed of three classifiers (the training set is the moons dataset, introduced in Chapter 5):
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
Ther you have it! The voting classifier slightly outperforms all the individual classifiers.
If all classifiers are able to estimate class probabilities (i.e., they have a predict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers.
This is called soft voting.
It often achieves higher performance than hard voting because it gives more weight to highly confident votes.
All you need to do is replace voting="hard" with voting="soft" and ensure that all classifiers can estimate class probabilities.
This is not the case of the SVC class by default, so you need to set its probability hyperparameter to True (this will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method).
If you modify the preceding code to use soft voting, you will find that the voting classifier achieved over 91% accuracy!
Bagging and Pasting
One way to get a diverse set of classifiers is to use very different training algorithms, as just discusses. Another approach is to use the same training algolithm for every predictor, but to train them on different random subset of the training set.
When sampling is performed with replacement, this method is called bagging.
When sampling is performed without replacement, it is called pasting.
Bagging and Pasting in Scikit-Learn
The followin g code trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set boostrup=False).
The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions (-1 tells Scikit-Learn to use all available cores):
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500,
max_samples=100, boostrap=True, n_job=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
The BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is the case with Decision Trees classifications.
As we have discussed, a Random forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set.
Instead of building a BaggingClassifier and passing in a DecisionTreeClassifier, you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees (similarly,there is a RandomForestRegressor class for regression tasks).
The following code trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores.
from sklearn.ensemble import RandomForestClassifier
With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.
The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:
It is hard to tell in advance whether a RandomForestClassifier will perform better or worser than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them using cross-validation (and tuning the hyperparameters using grid search)
Feature Importance
Yet another quality of Random Forests is that they make it easy to measure the relative importance of each feature.
Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest).
More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it (see Chapter 6).
Scikit-Learn computes this score automatically for each feature after training, that it scale the results so that the sum of all importances is equal to 1.
You can access the result using the feature_importances_ variable.
For example, the following code trains a RandomForestClassifier on the iris dataset (introduced in Chapter 4) and outputs each feature's importance.
It seems that tha most important features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison (11% and 2%, respectively).
>>> for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
>>> print(name, score)
>>>
sepal length (cm) 0.112492250999
sepal width (cm) 0.0231192882825
petal length (cm) 0.441030464364
petal width (cm) 0.423357996355
Similarly, if you train a Random Forest classifier on the MNIST dataset (introduced in Chapter 3) and plot each pixel's importance, you get the image represented in Figure 7-6.
Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.
*ab initio MDに必要な、DFTによる原子分子の力の計算を、DFTに変わって行えるように訓練するニューラルネットの1つが、DTNNである。
*H. Wangらは、DTNNのような機能を組み込んだ、ab initio MDとして、DeePMDを開発した。
*今日は、このDeePMDについて、もう少し詳しく見ていきたいと思う。
・DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics, H. Wang, L. Zhang, J. Han and W. E, arXiv 31 Dec 2017
・なので、タイトルを、Machine Learning without DNNに変えて、1回か2回やってから、Quantum-chemical insightに戻ってこよう。
つづく
*F. CholletさんのDeep Learning with Pythonのテキストを使って、このスペースで、プログラムコードの勉強をしよう。(写経レベル)
・毎回載せている画像の加工に使っているNeural Style transfer in Kerasを使わせていただこう。
Lat's start by defining the paths to the style-reference image and the target imge. To make sure that the processed images are a similar size (widely different sizes make style transfer more difficult), you'll later resize them all to a shared height of 400 px.
Listing 8.14 Defining initial variables
from keras.preprocessing.image import load_img, img_to_array
target_image_path = 'img/portrait.jpg' # path to the image you want to transform
style_reference_image_path = 'img/transfer_style_reference.jpg' # path to the style image
*Free energy of proton transfer at the water - TiO2 interface from ab initio deep potential molecular dynamics,M. F. Calegari Andrade, H-Y. Ko, L. Zhang, R. Car and A. Selloni, Chem. Sci., 2020
*MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations, R. G. Gowers et al., Proc. of the 15th Python in Science Conf. (SciPy 2016)
2. Gaussian feature expansion of the inter-atomic distance
・さらに、cとdを用いて、原子間相互作用項vが導入される。 3. Perform T interaction passes
・とりあえず、次のステップへ。
4. Predict energy contributions
・最終ステップ
・5つのプロセスのうちの3番目のinteraction pass Tの意味がわからない。
*K. T. Schüttさんの学位論文があった。2018年5月25日付けである。
*学位論文のタイトル:Learning Representations of Atomistic Systems with Deep Neural Networks
・内容は、DTNNの論文と、2件のSchNetの論文と、2014年のHow to represent crystal structures for machine learning: towards fast prediction of electronic properties., の計4件の論文をまとめたもののようである。