2021-08-01

Persistent Homologyって何だろう？

Persistent Homologyが面白そうなので調べてみよう。

Persistent Homology — a Survey
Herbert Edelsbrunner and John Harer, Article · January 2008, DOI: 10.1090/conm/453/08802

ABSTRACT.

Persistent homology is an algebraic tool for measuring topological features of shapes and functions. It casts the multi-scale organization we frequently observe in nature into a mathematical formalism. Here we give a record of the short history of persistent homology and present its basic concepts. Besides the mathematics we focus on algorithms and mention the various connections to applications, including to biomolecules, biological networks, data analysis, and geometric modeling.

Persistent homologyは、形状と関数の位相的特徴を測定するための代数的ツールです。それは、私たちが自然界で頻繁に観察するマルチスケールの組織を数学的形式にキャストします。ここでは、Persistent homologyの短い歴史の記録を示し、その基本的な概念を示します。数学に加えて、アルゴリズムに焦点を当て、生体分子、生物学的ネットワーク、データ分析、幾何学的モデリングなど、アプリケーションへのさまざまな接続について説明します。by Google翻訳

非常にベーシックな内容で、これを理解してからでないと前に進めないような気がするが、さっぱりわからん。

Persistent homologyの最近の話題を知りたいと思って、2020年以降の文献を調べたら、deep learningによるセグメンテーションの精度を向上するためにPersistent homologyの考え方を使うという論文があった。

A Topological Loss Function for Deep-Learning based Image Segmentation using Persistent Homology
James R. Clough, Nicholas Byrne, Ilkay Oksuz, Veronika A. Zimmer, Julia A. Schnabel, Andrew P. King
Abstract

We introduce a method for training neural networks to perform image or volume segmentation in which prior knowledge about the topology of the segmented object can be explicitly provided and then incorporated into the training process. By using the
differentiable properties of persistent homology, a concept used in topological data analysis, we can specify the desired topology of segmented objects in terms of their Betti numbers and then drive the proposed segmentations to contain the specified topological features. Importantly this process does not require any ground-truth labels, just prior knowledge of the topology of the structure being segmented. We demonstrate our approach in four experiments: one on MNIST image denoising and digit recognition, one on left ventricular myocardium segmentation from magnetic resonance imaging data from the UK Biobank, one on the ACDC public challenge dataset and one on placenta segmentation from 3-D ultrasound. We find that embedding explicit prior knowledge in neural network segmentation tasks is most beneficial when the segmentation task is especially challenging and that it can be used in either a
semi-supervised or post-processing context to extract a useful training gradient from images without pixelwise labels.

画像またはボリュームのセグメンテーションを実行するようにニューラルネットワークをトレーニングする方法を紹介します。この方法では、セグメント化されたオブジェクトのトポロジに関する事前知識を明示的に提供して、トレーニングプロセスに組み込むことができます。トポロジカルデータ分析で使用される概念であるpersistent homologyの微分可能なプロパティを使用することにより、ベッチ数の観点からセグメント化されたオブジェクトの目的のトポロジを指定し、指定されたトポロジカルな特徴を含むように提案されたセグメンテーションを駆動できます。重要なことに、このプロセスでは、セグメント化されている構造のトポロジに関する事前の知識だけで、グラウンドトゥルースラベルは必要ありません。 4つの実験でアプローチを示します。1つはMNIST画像のノイズ除去と数字認識、1つはUK Biobankの磁気共鳴画像データからの左心室心筋セグメンテーション、1つはACDCパブリックチャレンジデータセット、もう1つは3D超音波からの胎盤セグメンテーションです。ニューラルネットワークセグメンテーションタスクに明示的な事前知識を埋め込むことは、セグメンテーションタスクが特に困難な場合に最も有益であり、半教師ありまたは後処理のコンテキストで使用して、ピクセル単位のラベルのない画像から有用なトレーニング勾配を抽出できることがわかります。by Google翻訳

Index Terms—Segmentation, Persistent Homology, Topology, Medical Imaging, Convolutional Neural Networks

（ニューラルネットワークに、数学、物理学、化学、生物学、薬学、医学、・・・、を教える（事前知識として与える）ことによって、人を超える人工知能を作り出すことができるということになるのだろう。科学の発展は、数学に始まり、物理学、化学、生物学、医学の順に進んできた。数学の前に論理学（哲学）。）

James R. Cloughらの1つ前？の論文を見てみよう。

Explicit topological priors for deep-learning based image segmentation using persistent homology

James R. Clough, Ilkay Oksuz, Nicholas Byrne, Julia A. Schnabel and Andrew P. King
School of Biomedical Engineering & Imaging Sciences, King’s College London, UK

arXiv:1901.10244v1 [cs.CV] 29 Jan 2019

1 Introduction
Image segmentation, the task of assigning a class label to each pixel in an image, is a key problem in computer vision and medical image analysis. The most successful segmentation algorithms now use deep convolutional neural networks (CNN), with recent progress made in combining fine-grained local features with coarse-grained global features, such as in the popular U-net architecture [17]. Such methods allow information from a large spatial neighbourhood to be used in classifying each pixel. However, the loss function is usually one which considers each pixel individually rather than considering higher-level structures collectively.

画像の各ピクセルにクラスラベルを割り当てるタスクである画像セグメンテーションは、コンピュータビジョンと医療画像分析における重要な問題です。現在、最も成功しているセグメンテーションアルゴリズムは、深い畳み込みニューラルネットワーク（CNN）を使用しており、最近では、人気のあるU-netアーキテクチャ[17]のように、きめの細かいローカル特徴と粗いグローバル特徴の組み合わせが進歩しています。このような方法により、大きな空間的近傍からの情報を使用して各ピクセルを分類することができます。ただし、損失関数は通常、高レベルの構造をまとめて考慮するのではなく、各ピクセルを個別に考慮する関数です。by Google翻訳

この呼応レベルの構造をまとめて考慮するために導入されるのが、Explicit topological priors ということなのだろう。

In many applications it is important to correctly capture the topological characteristics of the anatomy in a segmentation result. For example, detecting and counting distinct cells in electron microscopy images requires that neighbouring cells are correctly distinguished. Even very small pixelwise errors, such as incorrectly labelling one pixel in a thin boundary between cells, can cause two distinct cells to appear to merge. In this way significant topological errors can be caused by small pixelwise errors that have little effect on the loss function during training but may have large effects on downstream tasks. Another example is the modelling of blood flow in vessels, which requires accurate determination of vessel connectivity. In this case, small pixelwise errors can have a significant impact on the subsequent modelling task. Finally, when imaging subjects who may have congenital heart defects, the presence or absence of small holes in the walls between two chambers is diagnostically important and can be identified from images, but using current techniques it is difficult to incorporate this relevant information into a segmentation algorithm. For downstream tasks it is important that these holes are correctly segmented but they are frequently missed by current segmentation algorithms as they are insufficiently penalised during training. See Figure 1 for examples of topologically correct and incorrect segmentations of cardiac magnetic resonance images (MRI).

多くのアプリケーションでは、セグメンテーション結果で解剖学的構造のトポロジー特性を正しくキャプチャすることが重要です。たとえば、電子顕微鏡画像で別個の細胞を検出およびカウントするには、隣接する細胞を正しく区別する必要があります。セル間の薄い境界で1つのピクセルに誤ってラベルを付けるなど、非常に小さなピクセル単位のエラーでも、2つの異なるセルがマージされているように見える場合があります。このように、重大なトポロジエラーは、トレーニング中の損失関数にはほとんど影響を与えないが、ダウンストリームタスクには大きな影響を与える可能性がある小さなピクセル単位のエラーによって引き起こされる可能性があります。別の例は、血管内の血流のモデリングであり、血管の接続性を正確に決定する必要があります。この場合、小さなピクセル単位のエラーは、後続のモデリングタスクに大きな影響を与える可能性があります。最後に、先天性心疾患の可能性がある被験者を画像化する場合、2つのチャンバー間の壁に小さな穴があるかどうかは診断上重要であり、画像から識別できますが、現在の手法では、この関連情報をセグメンテーションアルゴリズムに組み込むことは困難です。ダウンストリームタスクの場合、これらの穴が正しくセグメント化されていることが重要ですが、トレーニング中にペナルティが不十分であるため、現在のセグメント化アルゴリズムでは見落とされることがよくあります。心臓磁気共鳴画像（MRI）のトポロジー的に正しいセグメンテーションと正しくないセグメンテーションの例については、図1を参照してください。by Google翻訳

f:id:AI_ML_DL:20210809102046p:plain

もう1つ、別の論文を見てみよう。

Persistent-Homology-based Machine Learning and its Applications – A Survey
Chi Seng Pun et al., arXiv:1811.00252v1 [math.AT] 1 Nov 2018

Abstract
A suitable feature representation that can both preserve the data intrinsic information and reduce data complexity and dimensionality is key to the performance of machine learning models. Deeply rooted in algebraic topology, persistent homology (PH) provides a delicate balance between data simplification and intrinsic structure characterization, and has been applied to various areas successfully. However, the combination of PH and machine learning has been hindered greatly by three challenges, namely topological representation of data, PH-based distance measurements or metrics, and PH-based feature representation. With the development of topological data analysis, progresses have been made on all these three problems, but widely scattered in different literatures.
In this paper, we provide a systematical review of PH and PH-based supervised and unsupervised models from a computational perspective. Our emphasizes are the recent development of mathematical models and tools, including PH softwares and PH-based functions, feature representations, kernels, and similarity models. Essentially, this paper can work as a roadmap for the practical application of PH-based machine learning tools. Further, we consider different topological feature representations in different machine learning models, and investigate their impacts on the protein secondary structure classification.

データ固有の情報を保持し、データの複雑さと次元を削減できる適切な特徴表現は、機械学習モデルのパフォーマンスの鍵となります。代数的トポロジーに深く根ざした永続的ホモロジー（persistent homology：PH）は、データの単純化と固有の構造特性の微妙なバランスを提供し、さまざまな分野にうまく適用されています。ただし、PHと機械学習の組み合わせは、データのトポロジ表現、PHベースの距離測定または指標、PHベースの特徴表現という3つの課題によって大きく妨げられてきました。トポロジーデータ分析の開発により、これら3つの問題すべてについて進歩が見られましたが、さまざまな文献に広く散らばっています。

この論文では、計算の観点から、PHおよびPHベースの教師ありモデルと教師なしモデルの体系的なレビューを提供します。私たちが強調しているのは、PHソフトウェアとPHベースの関数、特徴表現、カーネル、類似性モデルなど、数学モデルとツールの最近の開発です。基本的に、このペーパーは、PHベースの機械学習ツールの実用化のためのロードマップとして機能します。さらに、さまざまな機械学習モデルでさまざまな位相的特徴表現を検討し、タンパク質の二次構造分類への影響を調査します。 by Google翻訳

面白いが、とりあえず、本日をもって打ち切る。8月31日（火）記。

f:id:AI_ML_DL:20210731234854p:plain — style=170 iteration=500

2021-07-17

燃料電池と機械学習（fuel cell and machine learning）：2021年7月下旬～8月下旬

この1か月間でmachine learning, deep learingの燃料電池開発への応用について学ぶ。

Fundamentals, materials, and machine learning of polymer electrolyte membrane fuel cell technology
Yun Wang et al., Energy and AI 1 (2020) 100014

f:id:AI_ML_DL:20210717214708p:plain

Machine learning and artificial intelligence (AI) have received increasing attention in material/energy development. This review also discusses their applications and potential in the development of fundamental knowledge and correlations, material selection and improvement, cell design and optimization, system control, power management, and monitoring of operation health for PEM fuel cells, along with main physics in PEM fuel cells for physics-informed machine learning.

4. Machine learning in PEMFC development

4.1. Machine learning overview

According to learning style, machine learning algorithms can be generally classified into three types: supervised learning（教師あり学習）, unsupervised learning（教師なし学習）, and reinforcement learning（強化学習）, as shown in Table 9 .

f:id:AI_ML_DL:20210808203434p:plain

Table 10 lists popular supervised learning algorithms and their characteristics.

f:id:AI_ML_DL:20210808204126p:plain

Among many machine learning（機械学習） methods, the rapid development of deep learning（ディープラーニング） in recent years has pushed it to the forefront of the field of AI.

Deep learning is the ANN with deep structures or multi-hidden layers [229-232] .

It can achieve good performance with the support of big data and complex physics, and has a much simpler mathematical form than many traditional machine learning algorithms.

（分子や結晶の構成原子の3次元原子座標と原子番号から、第一原理計算結果によってエネルギーや電子分布やエネルギーバンド計算などを行い、構成原子の3次元座標と原子番号と、第一原理計算結果を教師データにして、ANNを学習させると、新たに構成原子の3次元座標と原子番号を、学習させたANNに入力すれば、エネルギーや電子分布やエネルギーバンド計算結果が、第一原理計算を行ったのと同等の正確さで、ANNから出力することができる。）

The relationship between AI, machine learning, and deep learning is shown in Fig. 2 , along with the number of US patent applications per year [20] .

We can expect that deep learning, such as physics-informed learning, will become the most important path to AI.

（画像分類や自然言語処理などに用いられるANNは、教師データによってゼロから学ぶことによって分類や翻訳の機能を習得するのだが、自然科学分野への応用においては、ANNに物理化学の基礎を追加することによって、ANNは物理化学の分野において、大型計算機を使った場合と同等以上の正確さで、かつ、非常に短時間で、結果を出力できるようになってきている。）

However, deep learning relies on big data, and thus traditional machine learning still have strong applications, especially for interdisciplinary studies, and can solve problems with reasonable amounts of data.

Many open-source machine learning frameworks have been developed and made available to the general public, including Scikit-Learn, Caffe2, H2O, PyTorch (for neural networks), TensorFlow (for neural networks), and Keras (for neural networks).

4.2. Machine learning for performance prediction

PEMFC performance is characterized by the polarization curve, also called the I-V curve, which is determined by a number of factors including fuel cell dimensions, material properties, operation conditions, and electrochemical/physical processes [233-236] .

Various physical models and experimental methods have been proposed to predict or di- rectly measure the I-V curve, which are reviewed by many other works [ 158 , 160 , 202 , 237 ].

As an alternative approach, machine learning is capable of establishing the relationship between inputs and output performance through proper training of existing data, as shown in Fig. 18 .

f:id:AI_ML_DL:20210808213110p:plain

Mehrpooya et al. [233] experimentally constructed a database of PEMFC performance under various inlet humidity, temperature, and oxygen and hydrogen flow rates.

A two-hidden-layer ANN was then trained using the database to predict the performance under new conditions.

Total 460 points are contained in the database with 400 for training and 60 for testing, and R 2 of 0.982 (for the training) and 0.9723 (for the test) was achieved in their study.

（このレベルの内容では、手間がかかる割には、効果は少ない（小さい）と思う。）

Unlike physical models, the mapping between inputs and outputs constructed by machine learning models does not follow an actual physical process; thus, the machine learning approach is also called the blackbox model.

Machine learning has unique advantages in PEMFC modeling, which requires no prior knowledge, especially of the complex coupled transport and electrochemical processes occurring in PEMFC operation.

This significantly reduces the level of modeling difficulty and also makes it possible to take into account any processes in which the physical mechanisms are not yet known or formulated.

The machine learning method is also advantageous in terms of computational efficiency in the implementation process after proper training.

This characteristic makes machine learning potentially extremely important in the practical PEMFC applications which usually involve a large size multiple-cell system, dynamic variation, and long-term operation.

For a complex physical model that takes multi-physics into account, the computational and time costs are usually too high; a simplified physical model lacks of high prediction accuracy.

For even a small scale stack of 5–10 cells, physics model-based 3D simulation usually requires 10–100 million gridpoints and takes days or weeks for predicting one case of steady-state operation [ 158 , 160 , 241 ].

In this regard, machine learning could greatly help to broaden the application of complex physical models by leveraging on prediction accuracy and computational efficiency.

Using the simulation data from complex physical models to train a machine learning model is a popular approach, usually referred to as surrogate modeling.

A surrogate model can replace the complex physical model with similar prediction accuracy but higher computational efficiency.

Wang et al. [242] developed a 3D fuel cell model with a CL agglomerate sub-model to construct a database of the PEMFC performance with various CL compositions.

A data-driven surrogate model based on the SVM was then trained using the database, which exhibited comparable prediction capability to the original physical model with several-order higher computational efficiency.

It only took a second to predict an I-V curve using the surrogate model versus hundreds of processor-hours using the 3D physics-based model.

Owing to its computational efficiency of the surrogate model, the surrogate model, coupled with a generic algorithm (GA), is suitable for CL composition optimization.

Similarly, Khajeh-Hosseini-Dalasm et al. [243] combined a CL physical model and ANN to develop a surrogate model to predict the cathode CL performance and activation overpotential.

For fast prediction of the multi-physics state of PEM fuel cell, Wang et al. [244] developed a data-driven digital twinning frame work, as shown in Fig. 20 .

A database of temperature, gas reactant, and water content fields in a PEM fuel cell under various operating conditions was constructed using a 3D physical model.

Both ANN and SVM were used to solve the multi-physics data with spatial distribution characteristics.

The data-driven digital twinning framework mirrored the distribution characteristics of multi-physics fields, and ANN and SVM exhibited different prediction performances on different physics fields.

There is a great potential to improve the current two-phase models (e.g. the two-fluid and mixture approaches) of PEM fuel cells by using AI technology, for example, machine learning analysis of visualization data and VOF/LBM simulation results.

可視化データの機械学習分析やVOF / LBMシミュレーション結果などのAIテクノロジーを使用することで、PEM燃料電池の現在の2相モデル（2流体および混合アプローチなど）を改善する大きな可能性がある。

Physics-informed neural networks were recently proposed by Raissi et al. [174] , known as hidden fluid mechanics (HFM), to encode the Navier-Stokes (NS) equation into deep learning for analyzing fluid flow images, as shown in Fig. 21 .

Raissiらは、Navier-Stokesの式をディープラーニングに組み込むことによって、流体の流れを可視化することを可能にした、物理情報（この場合はNavier-Stokesの式）に基づくニューラルネットワーク、hidden fluid mechanics (HFM)、を最近提案した。

Such a strategy can be extended to the deep learning of two-phase flow and fuel cell performance by incorporating relevant physics, such as the capillary pressure correlation, Darcy’s law, and the Butler-Volmer equation, into the neural networks.

このような戦略は、キャピラリー圧力相関、ダルシーの法則、バトラー・ボルマー方程式などの関連する物理学をニューラルネットワークに組み込むことにより、二層流と燃料電池の性能の深層学習に拡張できます。

Table 11 summarizes the main physics in each PEMFC component that deep learning can incorporate to effec- tively achieve the design targets.

表11は、ディープラーニングが設計目標を効果的に達成するために組み込むことができる各PEMFCコンポーネントの主な物理学をまとめたものです。

f:id:AI_ML_DL:20210720231946p:plain

4.3. Machine learning for material selection

Machine learning is widely used in the chemistry and material communities to discover new material properties and develop next generation materials [245-247] .

Experimental measurement, characterization and theoretical calculation are main traditional methods to diagnose or predict the properties of a material, which are usually expensive in terms of cost, time, and computational resources.

Material properties are influenced by many intricate factors, which increases the difficulty level in the search for optimal material synthesis using only traditional methods.

Machine learning can assist in material selection and property prediction using existing databases, which is advantageous in taking into account unknown physics and greatly increasing the efficiency.

As example, in the catalyst design absorbate binding energy prediction by the empirical Sabatier principle is widely used for the optimization of activity in catalyst design ( Fig. 22 (a)) [247] .

To remove the empirical equation, a database of binding energy for different catalyst structures constructed by characterization or theoretical calculation is used to train a machine learning model, which shows a great efficiency in predicting the catalyst activity in a wide range to identify the optimal solution of the catalyst structure ( Fig. 22 (b)).

Owing to the great potentials of machine learning in chemistry and materials science, professional tools have been developed, along with universal machine learning frameworks, and numerous structure and property databases for molecules and solids can be easily accessed to model training.

Popular professional machine learning tools and databases are summarized in Table 12.

4.4. Machine learning for durability

A durable and stable PEM fuel cell that is reliable for the entire life of the system is crucial for its commercialization.

Thus, it is important to predict the state of health (SoH), the remaining useful life (RUL), and durability of PEM fuel cell using the data generated from monitoring units.

The cell voltage is the most important indicator of fuel cell performance and thus is a popular output parameter in the machine learning.

In recent years, machine learning has been employed to predict fuel cell durability and SoH, which can generally be classified as model-based and data-driven approaches.

fuel cell + materials informaticsとfuel cell + deep learningでGoogle Scholarで検索した。

+ matrials informaticsでは、2020年以降の論文のタイトルに、materials informaticsが入っている論文は2件しか出てこなかった。どちらも著者は日本人である。

+ deep learningでは、10件以上あり、machine learningやreinforcement learningなども含めると30件くらいは出てくる。

自分の直観では、materials informaticsは探索手段の1つにmachine learningやdeep learningを取り込み、それにによってパワーアップされた結果、machine learningやdeep learningなどが、materials informaticsの主要部分として牽引しているように思う。

7月18日（日）

f:id:AI_ML_DL:20210718104008p:plain

表1(a)には、4種類の燃料電池の出荷量の推移が示され、(b)にはPEMFC（polymer electrolyte membrane fuel cell）固体高分子形燃料電池の構成/構造図が示されている。

DMFC : Direct Methanol Fuel Cell, SOFC : Solid Oxide Fuel Cell, PAFC : Phospholic Acid Fuel Cell, PEMFC : Polymer Electrolyte Membrane Fuel Cell

セルの中央に高分子電解質膜、その左側にアノード触媒、右側にカソード触媒、さらに左側には水素の拡散層、右側には空気（酸素源）の拡散層がある。カソードの最外層を冷媒が流れる。アノード側で気体水素が水素イオンと電子に、カソード側で水素イオンと酸素と電子が水に変化する。

1.2. Current status and technical barriers

PEMFCの課題は、耐久性とコスト。

触媒層のコストとセルの耐久性は相反する。

1.3. Role of fundamentals, materials, and machine learning

7月20日（火）

4.2. Machine learning for performance prediction

引用文献[174]を見てみよう。

Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations
Maziar Raissi, Alireza Yazdani, and George Em Karniadakis, Science 367, 1026–1030 (2020)

For centuries, flow visualization has been the art of making fluid motion visible in physical and biological systems. Although such flow patterns can be, in principle, described by the Navier-Stokes equations, extracting the velocity and pressure fields directly from the images is challenging. We addressed this problem by developing hidden fluid mechanics (HFM), a physics-informed deep-learning framework capable of encoding the Navier-Stokes equations into the neural networks while being agnostic to
the geometry or the initial and boundary conditions. We demonstrate HFM for several physical and biomedical problems by extracting quantitative information for which direct measurements may not be possible. HFM is robust to low resolution and substantial noise in the observation data, which is important for potential applications.

何世紀にもわたって、流れの可視化は、物理的および生物学的システムで流体の動きを可視化する技術でした。このような流れのパターンは、原則としてナビエ・ストークス方程式で表すことができますが、画像から直接速度場と圧力場を抽出することは困難です。この問題に対処するために、物理学に基づいた深層学習フレームワークである隠れ流体力学（HFM）を開発しました。これは、ジオメトリや初期条件や境界条件に依存せずに、ナビエ-ストークス方程式をニューラルネットワークにエンコードできます。直接測定が不可能な可能性のある定量的情報を抽出することにより、いくつかの物理的および生物医学的問題に対するHFMを示します。 HFMは、観測データの低解像度とかなりのノイズに対してロバストです。これは、潜在的なアプリケーションにとって重要です。by Google翻訳

f:id:AI_ML_DL:20210720223708p:plain

We developed an alternative approach, which we call hidden fluid mechanics (HFM), that simultaneously exploits the information available in snapshots of flow visualizations and the NS equations, combined in the context of physicsinformed deep learning (5) by using automatic differentiation. In mathematics, statistics, and computer science—in particular, in machine learning and inverse problems—regularization is the process of adding information in order to prevent overfitting or to solve an ill-posed problem. The prior knowledge of the NS equations introduces important structure that effectively regularizes the minimization procedure in the training of neural networks. For example, using several snapshots of concentration fields (inspired by the drawings of da Vinci in Fig. 1A), we obtained quantitatively the velocity and pressure fields (Fig. 1, B to D).

自動微分を使用して、流れの可視化とNS方程式のスナップショットで利用可能な情報を、物理情報に基づく深層学習（5）のコンテキストで組み合わせて同時に活用する、隠れ流体力学（HFM）と呼ばれる代替アプローチを開発しました。数学、統計学、コンピューターサイエンス、特に機械学習と逆問題では、正則化とは、過剰適合を防止したり、不適切な問題を解決したりするために情報を追加するプロセスです。 NS方程式の予備知識は、ニューラルネットワークのトレーニングにおける最小化手順を効果的に正規化する重要な構造を導入します。たとえば、濃度場のいくつかのスナップショット（図1Aのダヴィンチの図に触発された）を使用して、速度場と圧力場を定量的に取得しました（図1、BからD）。by Gppgle翻訳

7月28日（水）

基礎知識：

ムーアの新物理化学：W. J. MOORE著藤代亮一訳：

8章　化学反応速度論

37．活性化吸着

吸着が起こるまでに乗り越さねばならないポテンシャルエネルギーの障壁はしばしば小さいかまたは無視される程度であるから、吸着の速度は裸の表面に気体の送り込まれる速さによって支配される。しかしながら、吸着にかなりの活性化エネルギーEadを必要とする場合があり、そのときの吸着速度（A*exp(-Ead/RT）は充分小さくなって表面反応の全体の速度がこれによって決定されるようになるだろう。このようにかなりの活性化エネルギーを要する吸着は活性化吸着とよばれる。

一般に金属表面上の気体の化学吸着にはあまり活性化エネルギーを必要とはしない。J. K. Robertsは、注意してきれいにした金属線上への水素の吸着は約25°Kでさえも速やかに進行し、強く水素原子の吸着された単分子層（単原子層）を作ることを示した。このときの吸着熱は、金属の水素化物の共有結合を作るのに要する熱量に近い。

・・・これと異なる挙動を示す一つの重要な例外は400℃における鉄触媒上への窒素の吸着である。この吸着はおそい活性化吸着で、この触媒を用いるアンモニア合成反応の律速段階であるように思われる。・・・

38．触媒の被毒

触媒はごく少量の異物によって被毒作用をうける。Faradayは、H2とO2の結合反応の触媒として用いる白金は、きれいで脂のついていないもので、また反応気体は一酸化炭素を含んでいてはならないと強調した。SO2をSO3に酸化する白金の非常に有効な触媒作用については19世紀の初めによく知られていたが、触媒がすぐにその活性を失うため、興行的には用いられなかった。高度に純粋にした反応気体、すなわち、イオウとヒ素化合物を除いた反応気体が得られて初めて長時間にわたって反応を続けさせることができた。

CO、H2S、ヒ素化合物のような触媒毒が強い生理学的毒物でもあることは、偶然の一致ではない。これらが動物に毒作用を呈する理由は、生存に必要な生化学反応を促進する酵素の被毒によって、その生化学反応が禁止されるからである。

触媒毒と反応物は有効な触媒表面のとり合いをする。・・・。ここで触媒の失活の程度は、毒によって占められた表面の割合と定量的に対応するだろうかという重要な疑問が生ずる。これはある場合には正しいが、少量の毒によってその表面積効果だけからは説明できないほど大きな阻害を受ける場合もよく知られている。

39．触媒表面の性質

巨視的には滑らかな固体表面でも10Åの単位では凹凸がある。最も優れた光学的技術によってへき開した結晶面を調べると、それが階段状の表面であることがわかる。金属からの光電子放射や熱イオン放射の実験を行うと、表面はいろいろ異なった仕事関数をもった部分からできていることがわかる。・・・。また結晶の稜や角、粒子と粒子の境界その他表面の物理的な不規則性は、異常に高い触媒作用をもつ活性中心になると考えられる。

7月29日（木）

元の論文に戻ろう。

Fundamentals, materials, and machine learning of polymer electrolyte membrane fuel cell technology, Yun Wang et al., Energy and AI 1 (2020) 100014

f:id:AI_ML_DL:20210729084311p:plain

Hydrogen Oxydation Reaction (HOR) :

アノード触媒層中では、水素ガス（水素分子）が酸化（電子を失う）され、水素イオン（H＋）と電子が生じる。

HORによって発生した水素イオン（H＋）は、高分子電解質（Polymer Electrolyte Material：PEM）層中を移動し、カソード触媒層中に移動する。

Oxygen Reduction Reaction (ORR) :

カソード触媒層中では、酸素ガス（酸素分子）が水素イオン（H＋）によって還元（電子を得る、水素と化合する）され、水（H2O）を生じる。

上図の左側から順に、水素ガス拡散層、アノード触媒層、高分子電解質層、カソード触媒層、酸素ガス拡散層、が積層されている。

3. Fundamentals and materials

For the hydrogen oxidation reaction (HOR) and oxygen reduction reaction (ORR) to proceed efficiently, the materials used in fuel cells must be chosen so that a high beginning of life performance and durability are ensured.

For example, to improve the activation and reduce transport losses, various issues as discussed earlier need to be addressed, including durable electrocatalyst and its loading reduction [2] , reactant/membrane contamination [ 91 , 92 ], water management [ 93 , 94 ], and degradation [ 95 , 96 ].

Material advance and improvement are therefore important for fuel cell R&D, and fundamentals that establish the material properties and fuel cell performance under various operation conditions are highly needed.

3.1. Materials

3.1.1. Membrane

The PEM is located between the anode and cathode CLs.

Its main functions are two-fold:

(i) it acts as a separator between the anode and the cathode reactant gasses and electrons, and

(ii) it conducts protons from the anode to cathode CLs.

Therefore, as a separator it must be impermeable to gasses (i.e., it should not allow the crossover of hydrogen and oxygen) and must be electrically insulating.

In addition, the membrane material must withstand the harsh operating conditions of PEM fuel cells, and thus possess high chemical and mechanical stability [97] .

f:id:AI_ML_DL:20210729095310p:plain

f:id:AI_ML_DL:20210804094935p:plain

これは、Nafion XLのSEM像

8月4日（水）

3.1.2. Catalyst layers

Catalyst layers (CLs) are the component where the electrochemical reactions occur.

触媒層は、電気化学反応が生じる場である。

The CL material must provide continuous pathways for various reactant species; primarily,

(i) a path for proton transport,

水素イオンの移動　

(ii) a pore network for gaseous reactant supply and water removal, and

水素ガスと酸素ガスの供給、水の排出

(iii) a passage for electron conduction between the CL and the current collector.

触媒層と集電体層の間の電子（電荷）の移動

The CL material is a major factor affecting fuel cell performance and durability.

Conventional CLs are composed of electrocatalyst, carbon support, ionomer, and void space.

従来型の触媒層は、電極触媒、炭素支持体、アイオノマー、及び、空隙からなる。

Optimization of the CL ink preparation has been the main driver in PEMFC development [ 21 , 102 ].

This breakthrough highlights the importance of the so-called triple-phase boundaries of the ionomer, Pt/C, and void space so that all reactants could access for the reactions.

Conventional CLs are prepared based on the dispersion of a catalyst ink comprising a Pt/C catalyst, ionomer, and solvent.

従来型の触媒層は、Pt/C触媒とアイオノマーと溶媒からなる触媒インクの分散（体）をベースに調整（調製）される。

Ink composition is important for aggregation of the ionomer and agglomeration of carbon particles, and the dispersion medium governs the ink’s properties, such as the aggregation dimension of the catalyst/ionomer particles, viscosity, and rate of solidification, and ultimately, the electrochemical and transport properties of the CLs [103-105] .

The ionomer not only acts as a binder for the Pt/C particles but also proton conductor.

アイオノマーは、Pt/Cのバインダーとしてだけでなくプロトン導電体としての機能も有する

Imbalance in the ionomer loading increases the transport or ohmic loss, with a small amount of ionomer reducing the proton conductivity and a large amount increasing the transport resistance of gaseous reactants.

アイオノマーが少ないと、イオン電導性が下がり、アイオノマーが多いと、（空隙率が下がるため）、気体反応物の輸送特性が低下する。

8月5日（木）

Understanding inks for porous-electrode formation
Kelsey B. Hatzell, Marm B. Dixit, Sarah A. Berlinger and Adam Z. Weber J. Mater. Chem. A, 5, 20527 (2017)

Scalable manufacturing of high-aspect-ratio multi-material electrodes are important for advanced energy storage and conversion systems. Such technologies often rely on solution-based processing methods where the active material is dispersed in a colloidal ink. To date, ink formulation has primarily focused on macro-scale process-specific optimization (i.e. viscosity and surface/interfacial tension), and been optimized mainly empirically. Thus, there is a further need to understand nano- and mesoscale interactions and how they can be engineered for controlled macroscale properties and structures related to performance, durability, and material utilization in electrochemical systems.

高アスペクト比のマルチマテリアル電極のスケーラブルな製造は、高度なエネルギー貯蔵および変換システムにとって重要です。このような技術は、多くの場合、活物質がコロイドインクに分散される溶液ベースの処理方法に依存しています。これまで、インクの配合は主にマクロスケールのプロセス固有の最適化（つまり、粘度と表面/界面張力）に焦点を当てており、主に経験的に最適化されてきました。したがって、ナノスケールとメソスケールの相互作用、および電気化学システムの性能、耐久性、材料利用に関連する制御されたマクロスケールの特性と構造のためにそれらをどのように設計できるかをさらに理解する必要があります。by Google翻訳

f:id:AI_ML_DL:20210805145736p:plain

f:id:AI_ML_DL:20210805150120p:plain

f:id:AI_ML_DL:20210805150226p:plain

f:id:AI_ML_DL:20210805150312p:plain

In summary, there is a growing need for fabricating porous electrodes with unprecedented control of layer composition. Key to this is knowledge of the underlying physics and phenomena going from multicomponent dispersions and inks to casting/processing to 3D structure. While there has been some recent work as highlighted herein, a great deal remains to be accomplished in order to inform predictive and not empirical optimizations. Such investigations have occurred in other fields such as semiconductors and coatings and dispersions in general, but this has not been translated to thin-film properties and functional layers as occur in electrochemical devices. Overall, ink engineering is an exciting opportunity to achieve next-generation composite materials, but requires systematic studies to elucidate design rules and metrics and identify controlling parameters and phenomena.

要約すると、層組成の前例のない制御を備えた多孔質電極を製造する必要性が高まっている。これの鍵は、多成分分散液やインクからキャスティング/プロセッシング、3D構造に至るまでの基礎となる物理学と現象に関する知識です。ここで強調されているように最近の作業がいくつかありますが、経験的な最適化ではなく予測的な最適化を通知するために、多くのことを達成する必要があります。このような調査は、一般に半導体やコーティング、分散液などの他の分野で行われていますが、これは電気化学デバイスで行われるような薄膜特性や機能層には変換されていません。全体として、インクエンジニアリングは次世代の複合材料を実現するための刺激的な機会ですが、設計ルールと測定基準を解明し、制御パラメーターと現象を特定するための体系的な研究が必要です。by google翻訳

元の論文に戻ろう。

Fundamentals, materials, and machine learning of polymer electrolyte membrane fuel cell technology, Yun Wang et al., Energy and AI 1 (2020) 100014

In contrast, non-conventional CLs are structured such that one of the major ingredients in their conventional counterparts is eliminated [ 2 , 102 ].

Nanostructured thin film (NSTF) CLs from 3 M are the most successful nonconventional　CL.

They consist of whiskers where the catalyst is deposited without ionomer for proton conduction.

Over the years, they have proven to provide a higher activity than conventional CLs, as seen in Fig. 5 .

In addition, similar to conventional CLs, annealing can be used to change the CL structure and ultimately change its activity.

f:id:AI_ML_DL:20210805170340p:plain

Fig. 5. Schematic illustration and corresponding HRTEM images of the mesoscale ordering during annealing and formation of the mesostructured thin film starting from the as-deposited Pt–Ni on whiskers (A), annealed at 300 °C (B) and 400 °C (C). Specific activities of Pt–Ni NSTF as compared to those of polycrystalline Pt and Pt-NSTF at 0.9 V (D) [106] .
[106] van der Vliet DF , Wang C , Tripkovic D , et al. Mesostructured thin films as electrocatalysts with tunable composition and surface morphology. Nat Mater 2012;11:1051–8 .

8月10日（火）：ペースアップ

Carbon is the most commonly used support material for catalyst because of its low cost, chemical stability, high surface area, and affinity for metallic nanoparticles.

The surface area of the support varies depending on its graphitization process and is reported to range from 10 to 2000 m 2 /g [107] .

Ketjen Black and Vulcan XC-72 are popular carbons with a surface area of 890 m 2 /g and 228 m 2 /g, respectively [108] .

Carbon tends to aggregate, forming carbon particle agglomerates with a bimodal pore size distribution (PSD).

This PSD is usually composed of the primary pores of typically 2–20 nm in size and sec- ondary pores larger than 20 nm.

The primary pores are located between carbon particles in an agglomerate, while the secondary pores are between agglomerates.

Depending on the Pt distribution and utilization within an agglomerate, the primary pores play a key role in determining the electrochemical kinetics, while the secondary pores are important for reactant transport across a CL.

The portion of the primary and secondary pores is largely determined by the surface area of the carbon support [108] .

Hence, it has been reported that carbon supports also determine the optimal ionomer content and the Pt distribution in CLs [ 109 , 110 ].

Additionally, the anode overpotential is usually considered negligible in comparison with its cathode counterpart because of the sluggish ORR.

Thus, most work in the literature is focused on cathode CLs.

CL optimization is focused on not only enhanced durability but also reduction of the Pt loading.

For this purpose, it is crucial to determine the optimal combination of the carbon support and catalyst for loading reduction.

An example is highlighted in Fig. 6 , where different carbons are heat-treated to induce the catalytic activities of PANI- derived catalysts and to ensure their performance and stability.

Rotating Ring-Disk Electrode (RDE) measurements were conducted to study the ORR activity of various heat-treated PANI-C catalysts as a function of temperature.

f:id:AI_ML_DL:20210809234006p:plain

The durability and stability of CL material are a major subject in R&D, which is related to multiple factors, mainly including (i) operating and environmental conditions, (ii) oxidant and fuel impurities, and (iii) contaminants and corrosion in cell components.

For instance, operation under high voltages (above 1.35 V), which may occur during fuel cell startup and shut-down, can lead to Pt dissolution [112] .

Operation further above this voltage will cause degradation of the carbon support, known as carbon corrosion.

In addition, any traces of a contaminant in the fuel or oxidant feeds can lead to a decrease in fuel cell performance by poisoning CL materials [ 113 , 114 ].

Some contaminants cover the Pt catalyst and then reduce the electrochemical surface area (ECSA) available for the reaction.

This catalytic contamination is usually reversible upon removal of the contaminants.

In certain instances, contaminants such as ammonia will cause irreversible degradation under adequate exposure time and concentration [44] .

Further, cell components, such as CLs and BPs, may contain contaminants, from their manufacturing process and/or material used, which eventually leach out and cause poi- soning of the MEA.

This may include membrane poisoning by metallic cations [91] .

Up to date, Pt is the electrocatalyst of choice for the ORR in PEM fuel cells because of its high activity.

However, Pt has a high cost associated with it and is currently mined in mainly several countries, such as South Africa and Russia.

Furthermore, high Pt loading is required to reach the target lifetime without major efficiency loss.

Using state-of-the-art methods, Pt catalyst is distributed in a way that does not allow its full utilization in CLs [ 115 , 116 ].

Alternative catalysts that are either Pt free or Pt alloys are under research.

Two excellent review papers on the topic are provided by Ref. [ 117 , 118 ].

A summary of some of these catalysts, their current status, and remaining challenges is provided in Fig. 7 .

f:id:AI_ML_DL:20210809235032p:plain

Machine learning and AI are extremely helpful and highly demanding for CL development providing that CLs have been extensively studied for not only PEM fuel cells, but also many other systems, such as electrolyzers and sensors with Pt-catalyst electrodes.

The species transport equations, ORR reaction kinetics, two-phase flow, and degrada- tion mechanisms can be encoded into the neural networks for effective physics-informed deep learning to understand the impacts of catalyst materials on fuel cell performance/durability and optimize the pore size, PSD, PTFE loading, ionomer content, and carbon and electrocatalyst loading.

In the mass production phase, machine learning and AI can assist the quality control of CL composition in signal processing and element analysis when integrated with detection techniques such as Laser Induced Breakdown Spectroscopy (LIBS) [119] .

文献検索：keyword : fuel cell deep learning

F.-K. Wang et al.: Hybrid Method for Remaining Useful Life Prediction of PEMFC Stack

ABSTRACT

Proton exchange membrane fuel cell (PEMFC) is a clean and efficient alternative technology for transport applications. The degradation analysis of the PEFMC stack plays a vital role in electric vehicles. We propose a hybrid method based on a deep neural network model, which uses the Monte Carlo dropout approach called MC-DNN and a sparse autoencoder model to analyze the power degradation trend of the PEMFC stack. The sparse autoencoder can map high-dimensional data space to low-dimensional latent space and significantly reduce noise data. Under static and dynamic operating conditions, using two experimental PEMFC stack datasets the predictive performance of our proposed model is compared with some published models. The results show that the MC-DNN model is better than other models. Regarding the remaining useful life (RUL) prediction, the proposed model can obtain more accurate results under different training
lengths, and the relative error between 0.19% and 1.82%. In addition, the prediction interval of the predicted RUL is derived by using the MC dropout approach.

プロトン交換膜燃料電池（PEMFC）は、輸送用途向けのクリーンで効率的な代替技術です。 PEFMCスタックの劣化分析は、電気自動車で重要な役割を果たします。 MC-DNNと呼ばれるモンテカルロドロップアウトアプローチとスパースオートエンコーダモデルを使用してPEMFCスタックの電力劣化傾向を分析するディープニューラルネットワークモデルに基づくハイブリッド手法を提案します。スパースオートエンコーダは、高次元のデータ空間を低次元の潜在空間にマッピングし、ノイズデータを大幅に削減できます。静的および動的な動作条件下で、2つの実験的なPEMFCスタックデータセットを使用して、提案されたモデルの予測パフォーマンスがいくつかの公開されたモデルと比較されます。結果は、MC-DNNモデルが他のモデルよりも優れていることを示しています。残りの耐用年数（RUL）の予測に関して、提案されたモデルは、さまざまなトレーニングの長さ、および0.19％から1.82％の相対誤差の下でより正確な結果を得ることができます。さらに、予測されたRULの予測区間は、MCドロップアウトアプローチを使用して導出されます。by Google翻訳

IEEE PHM 2014 Data Challengeで使われたデータを用いているようである。

Y. Xie et al.: Novel DBN and ELM Based Performance Degradation Prediction Method for PEMFC

ABSTRACT

Lifetime and reliability seriously affect the applications of proton exchange membrane fuel cell (PEMFC). Performance degradation prediction of PEMFC is the basis for improving the lifetime and reliability of PEMFC. To overcome the lower prediction accuracy caused by uncertainty and nonlinearity characteristics of degradation voltage data, this article proposes a novel deep belief network (DBN) and extreme learning machine (ELM) based performance degradation prediction method for PEMFC. A DBN
based fuel cell degradation features extraction model is designed to extract high-quality degradation features in the original degradation data by layer-wise learning. To tackle the issues of overfitting and instability in fuel cell performance degradation prediction, an ELM with good generalization performance is introduced as a nonlinear prediction model, which can get some enhancement of prediction precision and reliability. Based
on the designed DBN-ELM model, the particle swarm optimization (PSO) algorithm is used in the model training process to optimize the basic network structure of DBN-ELM further to improve the prediction accuracy of the hybrid neural network. Finally, the proposed prediction method is experimentally validated by using actual data collected from the 5-cells PEMFC stack. The results demonstrate that the proposed approach always has better prediction performance compared with the existing conventional methods, whether in the cases of various training phase or the cases of multi-step-ahead prediction.

寿命と信頼性は、プロトン交換膜燃料電池（PEMFC）の用途に深刻な影響を及ぼします。 PEMFCの性能低下予測は、PEMFCの寿命と信頼性を向上させるための基礎です。劣化電圧データの不確実性と非線形特性によって引き起こされる低い予測精度を克服するために、この記事では、PEMFCの新しいディープビリーフネットワーク（DBN）とエクストリームラーニングマシン（ELM）ベースのパフォーマンス劣化予測方法を提案します。 DBNベースの燃料電池劣化特徴抽出モデルは、層ごとの学習によって元の劣化データから高品質の劣化特徴を抽出するように設計されています。燃料電池の性能劣化予測における過剰適合と不安定性の問題に取り組むために、優れた一般化性能を備えたELMが非線形予測モデルとして導入され、予測の精度と信頼性をある程度向上させることができます。設計されたDBN-ELMモデルに基づいて、粒子群最適化（PSO）アルゴリズムがモデルトレーニングプロセスで使用され、DBN-ELMの基本的なネットワーク構造をさらに最適化して、ハイブリッドニューラルネットワークの予測精度を向上させます。最後に、提案された予測方法は、5セルPEMFCスタックから収集された実際のデータを使用して実験的に検証されます。結果は、提案されたアプローチが、さまざまなトレーニングフェーズの場合でも、マルチステップアヘッド予測の場合でも、既存の従来の方法と比較して常に優れた予測パフォーマンスを持っていることを示しています。
by Google翻訳

I. INTRODUCTION
The proton exchange membrane fuel cells (PEMFC) have been taken as a potential power generation system for many fields, including electric vehicles, aerospace electronics, and
aircrafts [1], [2], due to its high conversion efficiency, low operation temperature, and clean reaction products [3], [4].

However, the fuel cell system is affected by multiple factors during operation, which reduces its reliability and shortens its lifetime [5].

Therefore, predicting the performance degradation can effectively indicate the health status of PEMFCs, which could provide a maintenance plan to reduce the failures and downtimes of PEMFCs, thereby extending their lifetime and increasing their reliability [6], [7].

The degradation prediction of PEMFCs can use the historical operating data, such as voltage, power, and impedance, to obtain early indications about fuel cell degradation trend and failure time [8].

The voltage drop is directly associated with failure modes and components aging of fuel cells, and it is also the easiest to obtain.

Thus, the voltage is commonly treated as the critical deterioration indicator reflecting the performance degradation of PEMFC [9], [10].

Current aging voltage prediction approaches can be grouped into two categories, model-based method, data-based method [11].

The model-based methods use the specific physical model or semi-empirical degradation model to provide the degradation estimation for the fuel cells.

However, their reliability is limited because the degradation mechanisms inside PEMFCs are still not fully understood [12].

Some other model-based methods use particle filter [13], Kalman filter [14], and their variants to estimate the health of PEMFC.

However, due to their limited nonlinear processing capabilities or low computational efficiency, they are difficult to describe the high nonlinearity and complexity of PEMFC aging processes.

Form a practical point of view, the data-based methods are more advantageous because they can represent the degradation features observed in the aging voltage data flexibly without any prior knowledge about the fuel cells [15].

Moreover, the data-based methods are easy to deploy, less computationally complex, and more suitable for practical online applications [8].

The existing different data-based methods can be divided into data analytics methods and machine learning methods.

Regression analysis approaches, such as autoregressive integrated moving average methods [15], locally weighted projection regression methods [16], and regime switch vector autoregressive methods [17], are some of the data analytics methods that have been adopted.

A large number of machine learning methods also achieve the great strides in PEMFC degradation prediction, including the support vector machine (SVM) based methods [18], relevance vector machine (RVM) based methods [19], Gaussian process state space based methods [20], back propagation neural network based methods [21], Echo State Network based methods [22], adaptive neuro-fuzzy inference system (ANFIS) based methods [23], extreme learning machine (ELM) based methods [24], and so on.

However, the above data-based methods build the prediction model without considering the degradation characteristics of the voltage data.

Thus they may not achieve better performance.

The actual data contain more fluctuations and noises, which limit the effectiveness of the regression analysis approaches.

Besides, some voltage recovery phenomena contained in the voltage degradation process of fuel cell exhibit the high nonlinear characteristics which cannot be fully extracted by these shallow neural networks mentioned in [21]–[24].

The general machine learning methods noted in [18]– [20] not only have the weak feature extraction ability but also are affected by many artificial determining factors such as their kernel functions construction [25].

Therefore, to improve the unsatisfactory prediction performance, the designed prediction method should be tightly integrated with data characteristics.

Furthermore, considering the weak feature extraction ability of shallow models, it is better to employ the deep learning architecture for PEMFC degradation prediction.

To overcome the above problems, a novel PEMFC performance degradation prediction model based on the deep belief network (DBN) and extreme learning machine (ELM) is proposed for the first time, which considers the statistical characteristics of original degradation data.

Deep Belief Network, as a deep learning method [26], has achieved state-of-the-art results on challenging modelling and regression problems for highly nonlinear statistical data.

DBN can learn high-quality and robust features from the data through multiple layers of nonlinear feature transformation [27], which achieves high precision recognition on handwritten digits [28] and facial expression [29].

It can also accurately describe the complex mapping relationships between inputs and features and has achieved state-of-the-art results on lifetime prediction problems of Multi-bearing [30], lithium batteries [31] and rotating components [32].

Thus, the DBN method with good feature extraction and expression abilities is adopted in this article to learn the deep PEMFC degradation features from a large number of voltages that contain too much noise and redundant data.

However, the DBN model may encounter the problems of the overfitting and local minima when using the gradient-based learning algorithm to obtain network parameters.

The ELM method with good generalization and universal approximation capability [33] is introduced to solve these limitations.

In the proposed DBN-ELM model, ELM services as a supervised regressor on the top layer to obtain the solutions directly without such trivial issues [34].

Furthermore, the ELM regressor can employ the deep feature provided by DBN to obtain a relatively stable prediction performance, which can avoid the ill-posed problems [35] in common ELM caused by data statistical characteristics [36] and the initialization mode [37].

In short, the proposed DBN-ELM method employs the DBN to extract high-quality degradation features and generate a relatively stable feature space which is, in turn, fed into an ELM to perform PEMFC degradation voltage prediction.

The propose d novel prediction model combines the excellent feature learning ability of DBN and generalization performance of ELM, which aims to enhance PEMFC degradation prediction performance.

Furthermore, to further improve the prediction accuracy, the particle swarm optimization (PSO) algorithm as the optimization tool is adopted into the design of the DBN-ELM model.

The PSO algorithm with the advantages of fast search speed, simple structure, and good memory ability [23] is widely used to optimize the structure [38]–[40] and parameters [23], [41], [42] of neuralnetworks (NN).

Thus, this article uses the PSO algorithm with time-varying inertia weight [43] to adjust the structural parameters of the DBN-ELM and improve prediction accuracy.

Finally, the proposed DBN-ELM method is verified by different case studies on a 1kW PEMFC experimental platform.

The novelty and contributions of this article can be summarized as follows:

• The degradation characteristics of the experimental voltage data are firstly analyzed, which guides the tailored design of the high-performance prediction model.
• The DBN method is originally applied to the PEMFC performance degradation prediction for high-level degradation features extraction and learning.
• The novel DBN-ELM method can accurately infer future voltage degradation changes of the PEMFC stack.
• The PSO algorithm is introduced into the design of the proposed DBN-ELM prediction model to further improve the performance of PEMFC degradation prediction.
• Experimental results demonstrate the accuracy and generalization performance of the proposed method in PEMFC degradation prediction.

f:id:AI_ML_DL:20210810154836p:plain

f:id:AI_ML_DL:20210810154916p:plain

f:id:AI_ML_DL:20210810155553p:plain

f:id:AI_ML_DL:20210810155636p:plain

f:id:AI_ML_DL:20210810155706p:plain

f:id:AI_ML_DL:20210810155745p:plain

この論文でも、使っているデータはIEEE PHM 2014 Data Challengeのものであり、Kaggleのコンペでスコア争いをしているのと変わらない。

用意されたデータセットに対して良いスコアが出ても、実際の開発現場で使えるかどうかわからない。どう使うのだろうか。

触媒層のTEM観察が気になったので文献を調べてみた。
Testing fuel cell catalysts under more realistic reaction conditions: accelerated stress tests in a gas diffusion electrode setup
Shima Alinejad et al., J. Phys.: Energy 2 (2020) 024003
Abstract

Gas diffusion electrode (GDE) setups have very recently received increasing attention as a fast and straightforward tool for testing the oxygen reduction reaction (ORR) activity of surface area proton exchange membrane fuel cell (PEMFC) catalysts under more realistic reaction conditions. In the work presented here, we demonstrate that our recently introduced GDE setup is suitable for benchmarking the stability of PEMFC catalysts as well. Based on the obtained results, it is argued that the GDE setup offers inherent advantages for accelerated degradation tests (ADT) over classical three-electrode setups using liquid electrolytes. Instead of the solid–liquid electrolyte interface in classical electrochemical cells, in the GDE setup a realistic three-phase boundary of (humidified) reactant gas, proton exchange polymer (e.g. Nafion) and the electrocatalyst is formed. Therefore, the GDE setup not only allows accurate potential control but also independent control over the reactant atmosphere, humidity and temperature. In addition, the identical location transmission electron microscopy (IL-TEM) technique can easily be adopted into the setup, enabling a combination of benchmarking with mechanistic studies.

ガス拡散電極（GDE）のセットアップは、より現実的な反応条件下で表面積プロトン交換膜燃料電池（PEMFC）触媒の酸素還元反応（ORR）活性をテストするための高速で直接的なツールとして、ごく最近注目を集めています。ここで紹介する作業では、最近導入されたGDEセットアップが、PEMFC触媒の安定性のベンチマークにも適していることを示しています。得られた結果に基づいて、GDEセットアップは、液体電解質を使用する従来の3電極セットアップよりも加速劣化テスト（ADT）に固有の利点を提供すると主張されています。従来の電気化学セルの固液電解質界面の代わりに、GDEセットアップでは、（加湿）反応性ガス、プロトン交換ポリマー（Nafionなど）、および電極触媒の現実的な3相境界が形成されます。したがって、GDEのセットアップにより、正確な電位制御だけでなく、反応物の雰囲気、湿度、温度を独立して制御することもできます。さらに、同一位置透過型電子顕微鏡（IL-TEM）技術をセットアップに簡単に採用できるため、ベンチマークと機構研究の組み合わせが可能になります。by Google翻訳

2.2. Gas diffusion electrode cell setup.
An in-house developed GDE cell setup was employed in all electrochemical measurements that was initially designed for measurements in hot phosphoric acid [24]. The design used in the present study has been described before [31]. In short, it was optimized to low temperature PEMFC conditions(<100 °C) by placing a Nafion membrane between the catalyst layer and liquid electrolyte; no liquid electrolyte is in direct contact with the catalyst[31]. A photograph of the parts of the improved GDE setup is shown in figure 1.

f:id:AI_ML_DL:20210810210021p:plain

An advantage of half-cells with a liquid electrolyte - compared to MEA test - is the possibility of performing IL-TEM measurements to analyze the degradation mechanism leading to the loss in active surface area.

Here, we demonstrate that the same is feasible in the GDE setup, and even elevated temperatures can be used; see figure 5.

By placing the TEM grid between the membrane electrolyte and GDL, the IL-TEM method can be applied straightforwardly.

For the demonstration, a catalyst with lower Pt loading (20 wt%) was used to facilitate the ability to follow the change in individual particles.

The typical degradation phenomena, such as migration and coalescence (yellow circles) and particle detachment (red circle), can be clearly seen to occur as consequence of
the load-cycle treatment.

液体電解質を備えた半電池の利点は、MEAテストと比較して、IL-TEM測定を実行して、活性表面積の損失につながる劣化メカニズムを分析できることです。

ここでは、同じことがGDEセットアップでも実行可能であり、高温でも使用できることを示します。図5を参照してください。

膜電解質とGDLの間にTEMグリッドを配置することにより、IL-TEM法を簡単に適用できます。
デモンストレーションでは、個々の粒子の変化を追跡する能力を促進するために、より低いPt負荷（20 wt％）の触媒が使用されました。

移動と合体（黄色の円）や粒子の剥離（赤い円）などの典型的な劣化現象は、負荷サイクル処理の結果として発生することがはっきりとわかります。

f:id:AI_ML_DL:20210810204548p:plain

アイオノマーのラマン分析も調べておこう。

Chemical States of Water Molecules Distributed Inside a Proton Exchange Membrane of a Running Fuel Cell Studied by Operando Coherent Anti-Stokes Raman Scattering Spectroscopy
Hiromichi Nishiyama, Shogo Takamuku, Katsuhiko Oshikawa, Sebastian Lacher, Akihiro Iiyama and Junji Inukai, J. Phys. Chem. C 2020, 124, 9703−9711

ABSTRACT:

On the performance and stability of proton exchange membrane fuel cells (PEMFCs), the water distribution inside the membrane has a direct influence.

In this study, coherent anti-Stokes Raman scattering (CARS) spectroscopy was applied to investigate the different chemical states of water (protonated, hydrogen-bonded (H-bonded) and non-H-bonded water) inside the membrane with high spatial (10 μm φ (area) × 1 μm (depth)) and time (1.0 s) resolutions.

The number of water molecules in different states per sulfonic acid group in a Nafion membrane was calculated using the intensity ratio of deconvoluted O−H and C−F stretching bands in CARS spectra as a function of current density and at different locations.

The number of protonated water species was unchanged regardless of the relative humidity (RH) and current density, whereas H-bonded water molecules increased with RH and current density.

This monitoring system is expected to be used for analyzing the transient states during the PEMFC operation.

プロトン交換膜燃料電池（PEMFC）の性能と安定性には、膜内の水の分布が直接影響します。この研究では、コヒーレント反ストークスラマン散乱（CARS）分光法を適用して、膜の内部の水のさまざまな化学状態（プロトン化、水素結合（H結合）、および非H結合水）を高い空間（10μmφ（面積）×1μm（深さ））および時間（1.0秒）分解能で調査しました。ナフィオン膜のスルホン酸基あたりのさまざまな状態の水分子の数は、電流密度の関数として、さまざまな場所で、CARSスペクトルのデコンボリューションされたO-HおよびC-F伸縮バンドの強度比を使用して計算されました。プロトン化された水種の数は、相対湿度（RH）と電流密度に関係なく変化しませんでしたが、H結合水分子はRHと電流密度とともに増加しました。この監視システムは、PEMFC運転中の過渡状態の分析に使用されることが期待されています。by Google翻訳（修正）

f:id:AI_ML_DL:20210810213340p:plain

f:id:AI_ML_DL:20210810213436p:plain

f:id:AI_ML_DL:20210810213515p:plain

f:id:AI_ML_DL:20210810213607p:plain

coherent anti-Stokes Raman scattering (CARS) spectroscopyは、知らなかった。

3000から3500cm-1のブロードなピーク、O-H伸縮振動を、5つの成分に分けている。これについて調べてみよう。

Peak 1 : 3059 cm-1 : eigen cation H3O+

Peak 2 : 3289 cm-1 : H-bonded to SO3-

Peak 3 : 3371 cm-1 : Zundel cation H5O2+

Peak 4 : 3483 cm-1 : H-bonded to H2O

Peak 5 : 3559 cm-1 : non-H-bonded water

8月11日（水）

水の水素結合を調べた文献がある。

Signatures of the hydrogen bonding in the infrared bands of water
J.-B. Brubach et al., THE JOURNAL OF CHEMICAL PHYSICS 122, 184509 s2005d

f:id:AI_ML_DL:20210811094745p:plain

Following the above considerations on the OH bond oscillator strength as a function of the number of established H bonds, the three-Gaussian components were assigned to
three dominating populations of water molecules.

The lowest frequency Gaussian (ω=3295 cm−1) is assigned to molecules having H-bond coordination number close to four, as this component sits close to the OH band observed in ice.

The corresponding population is labeled “network water.”

Conversely, the highest frequency Gaussian (ω=3590 cm−1) is ascribed to water molecules being poorly connected to their environment since the frequency position of this component lies close to that of multimer molecules (for instance, ωdimer=3640 cm−1).

This population is called “multimer water.”

In between the two extreme Gaussians lies a third component (ω=3460 cm−1) which we associate with water molecules having an average degree of connection larger than that of dimers or trimers but lower than those participating to the percolating networks.

This type of molecules is referred to as “intermediate water.”

Obviously, this picture describes a situation averaged over time and any one molecule is expected to belong to the three types of population over several picoseconds.

The fact that the intermediate water Gaussian sits very close to the quasi-isobestic point
frequency means, according to our view, that the quasiisobestic point separates water molecules with respect to their involvement or noninvolvement in the long range connective structures, built up by almost fully bonded water molecules.

図3の枠内の右上に示されているように、3つのピーク分離に分離することによって、スペクトルの温度依存性をうまく説明できるとのこと。その結果を、先の5つのピークに分離した結果のうちの波数が近い同定結果を並べて以下に示す。これらの3つのピークは、非常に良く対応しているように思う。

lowest frequency Gaussian (ω=3295 cm−1) : close to the OH band observed in ice

Peak 2 : 3289 cm-1 : H-bonded to SO3-

third component (ω=3460 cm−1) : intermediate water

Peak 4 : 3483 cm-1 : H-bonded to H2O

highest frequency Gaussian (ω=3590 cm−1) : poorly connected to their environment

Peak 5 : 3559 cm-1 : non-H-bonded water

次の論文を読んでみたいが、有料なので、またの機会に！

Mechanism of Ionization, Hydration, and Intermolecular H-Bonding in Proton Conducting Nanostructured Ionomers
Simona Dalla Bernardina, Jean-Blaise Brubach, Quentin Berrod, Armel Guillermo, Patrick Judeinstein§, Pascale Roy and Sandrine Lyonnard

Abstract

Water–ions interactions and spatial confinement largely determine the properties of hydrogen-bonded nanomaterials. Hydrated acidic polymers possess outstanding proton-conducting properties due to the interconnected H-bond network that forms inside hydrophilic channels upon water loading.

We report here the first far-infrared (FIR) coupled to mid-infrared (MIR) kinetics study of the hydration mechanism in benchmark perfluorinated sulfonic acid (PFSA) membranes, e.g., Nafion.

The hydration process was followed in situ, starting from a well-prepared dry state, within unprecedented continuous control of the relative humidity.

A step-by-step mechanism involving two hydration thresholds, at respectively λ = 1 and λ = 3 water molecules per ionic group, is assessed.

The molecular environment of water molecules, protonic species, and polar groups are thoroughly described along the various states of the polymer membrane, i.e., dry (λ ≈ 0), fully ionized (λ = 1), interacting (λ = 1–3), and H-bonded (λ > 3).

This unique extended set of IR data provides a comprehensive picture of the complex chemical transformations upon loading water into proton-conducting membranes, giving insights into the state of confined water in charged nanochannels and its role in driving key functional properties as ionic conduction.

白金触媒の評価に関する論文を見よう！

New approach for rapidly determining Pt accessibility of Pt/C fuel cell catalysts
Ye Peng et al., J. Mater. Chem. A, 9, 13471 (2021)

A rapid method for evaluating accessibility of Pt within Pt/C catalysts for proton exchange membrane fuel cells (PEMFCs) is provided. This method relies on 3-electrode techniques which are available to most materials scientists, and will accelerate development of next generation PEMFC catalysts with optimal distribution of Pt within the carbon support.

短いアブストラクトだが、研究の目的が理解できない。

Proton exchange membrane fuel cells (PEMFCs) are rapidly gaining entry into many commercial markets ranging from stationary power to heavy duty/light duty transportation.

However, as the technology continues to advance, operating current densities are pushed ever higher while platinum group metal (PGM) loadings are pushed ever lower.

コストダウンと性能向上のためには、触媒量を減らし、電流密度を上げる、必要がある。

As this occurs, new challenges are being discovered which require materials-level advances to overcome.

In particular, as PGM loadings are reduced to a level =<0.125 mg cm-2, significant performance losses have been widely reported.

These losses are most clearly observed at current densities of >1.5 A cm-2 , and have been correlated very strongly with a decrease in ‘roughness factor’ (‘r.f.’, a measure of cm2 Pt per cm2 membrane electrode assembly (MEA)) at the cathode, leading several researchers to attribute this to an oxygen transport phenomenon occurring at each individual Pt site.

‘roughness factor’も意味が分からない。

f:id:AI_ML_DL:20210811162956p:plain

これは、表面積が小さいVulcan carbonと表面積が大きいKetjen blackの比較データで、白金を添加すると、いずれも表面積が低下している。それは、白金ナノ粒子が黒鉛のナノ空間を塞ぐためであると推測されている。

f:id:AI_ML_DL:20210811163520p:plain

Vulcan carbonとKetjen blackとで、性能が異なる。左側は、電流密度によって性能が逆転していることがわかる。右側は、Vulcan blackでは湿度依存性が小さいが、Ketjen blackでは湿度依存性が大きいことを示しており、この違いは、白金もアイオノマーも炭素材料の空隙に侵入していることによると推測されている。MEAレベルの実験をすれば、Pt/VCとPt/KBの比較ができるが、通常の研究室では、MEAを作製して試験することは容易ではない。MEA: Membrane electrode assembly（Gas (H2) diffution layer/Anode catalyst layer/PEM(Polymer electrolyte membrane)/Cathode catalyst layer/Gas (O2) diffusion layer)

f:id:AI_ML_DL:20210811164153p:plain

3D-TEMにより、白金粒子が炭素粒子の外側に付着しているか、内部に侵入しているかを識別できている。

f:id:AI_ML_DL:20210811165057p:plain

この図がこの論文の成果を示している、Hydrogen underpotential deposition (HUPD) をスイープ速度に対してプロットしたときの直線の傾きが、”Pt accessibility”の指標になっており、傾きが小さいPt/VCの方が、Pt/KBよりもPt accessibleだということが判定出来るとのこと。時間とコストがかかる3D-TEMを実施することや、グラム単位の白金触媒を用意してセル(MEA)を組み立てた試験を実施するよりも、低コスト、短時間で、Pt/Cの性能評価が可能、というのが、この論文の成果のようである。

8月12日（木）

触媒（層）の劣化試験結果に関するデータおよびその解析結果から、触媒層の性能とその劣化過程を推測していくのだが、そもそも、電気化学試験に関する経験がないので、途中で議論についていけなくなる。そこで、今日は、電気化学測定の基礎をまなぶこととしよう。

勉強資料は、分極曲線・サイクリックボルタンメトリ－（2）燃料電池（PEFC）
五百蔵　勉，安田　和明, Electrochemistry, 77，No. 3, 263-268（2009）

１　はじめに
固体高分子形燃料電池（PEFC）の研究では，分極測定とサイクリックボルタンメトリー（CV）は日常的に使用される解析手法である．

しかし，PEFCの研究においては，それらを前回の総論で扱ったような拡散係数や交換電流密度の決定に用いられることはあまりなく，より実用的な側面で利用されることが多い．

例えば，分極測定によってカソードの酸素還元活性化支配電流を求め，CV測定から得られた活性表面積の値で除することで比活性（触媒の単位表面積あたりの電流）を決定し，種々の触媒材料の活性を比活性という基準で比較するといったことが行われる．

また，近年PEFCの耐久性を向上させるための劣化要因解析が活発に行われているが，触媒劣化を加速したり，定量的に評価したりするためにも分極測定やCVのテクニックは必須である．

本稿では，発電可能な膜電極接合体（MEA）を用いた単セル，および回転電極など電解質水溶液を用いたハーフセルを使用した分極測定やCV測定について，データ解析の具体例をいくつか取り上げながら，実用的な解析法について紹介したい．

２　分極測定
２. １　MEA（単セル）を用いた分極測定
MEAでの分極測定を行うためには，単セルを組み発電可能な状態にセットすることが必要になる．Fig. 1にMEAの代表的な構造の模式図を示す．

f:id:AI_ML_DL:20210812104150p:plain

（よく見る模式図だが、スケールは意識したことがなかった。厚さわずか1㎜。）

分極曲線の測定法としては，非常にゆっくりとした走査速度でセル電圧を掃引して測定することもあるが，ある電流密度で一定時間保持して得られるセル電圧を，低電流密度から高電流密度まで順次測定していく定常法が一般的に用いられる．これは，電流密度を変更することにより MEA 内でガス・水分・電流などの分布が変化し，これらの状態が定常状態に落ち着くまでには5～10分程度かかるためである．

Fig. 2にPEFC単セルの定常分極曲線（電流－電圧曲線）の概念図を示す．ある負荷電流 i（A）におけるセル電圧 E（V）は下記のように表すことができる．
E＝E0－ηa－ηc－ηdiff－i･R （1）
ここで，E0は理論起電圧，ηaはアノード活性化分極，ηcはカソード活性化分極，ηdiffは物質移動による濃度分極，i･Rは抵抗分極（電流とセル内部抵抗の積）である．

f:id:AI_ML_DL:20210812105952p:plain

燃料が純水素でアノードが白金触媒であればアノード活性化分極が非常に小さいため，活性化分極はほぼカソードに起因すると考えてもよい．

このカソード活性化電圧が大きいことの原因/理由についてちょっと調べてみた。

津島将司氏らは、高温学会誌, 第 35 巻, 第 5 号（2009 年 9 月）の燃料電池の原理と特徴というタイトルの解説記事に次のように記述している。

PEMFCにおいては、アノード反応は、カソード反応に比べて電子移動がしやすく、アノードにおける活性化過電圧は、ほとんどの場合、無視できるほどに小さい。その一方で、カソードにおいては、アノードから供給される白金中の電子が、素過程をへて最終的には、生成物である水分子内に移動する必要がある。この電気化学反応の素過程は未だ十分には解明されたとは言いがたく、たとえば、反応初期には、酸素分子の白金への吸着、酸素原子とプロトンの結合による吸着 OH の形成、さらに、同様にOOH を形成し、その後、白金側からの電子移動により、水分子として脱離する、などの過程が考えられる。反応の素過程は十分には明らかではなくとも、カソード反応が進行するためには、電子移動を駆動するたの活性化過電圧が必要であり、とくに、PEMFC においてはアノードに比べて大きく、エネルギー損失の主要因となっていることが知られている。

分極曲線・サイクリックボルタンメトリ－（2）燃料電池（PEFC）

２. ２　ハーフセルを用いた分極測定
MEA による分極測定は実際的な方法であるが，一方で測定準備や手順が煩雑であり，またMEA 作製や発電条件など種々のファクターに影響を受ける．例えば触媒材料の評価を
意図した場合でも，単純には触媒自身の特性評価となっていないケースも見受けられる．一方，回転電極（RDE）などのハーフセルを用いた評価では測定が比較的シンプルで再現性も得やすく，触媒活性評価ではよく用いられている手法である．Fig. 3 にRDEを用いたハーフセル測定の装置図を示す．また，RDE では困難な高温や加圧雰囲気での測定では，チャンネルフロー電極を用いる方法なども利用されている4）．

f:id:AI_ML_DL:20210812162555p:plain

Fig. 4（a）にグラッシーカーボン電極に固定した白金担持カーボン（Pt/C 触媒）の酸素還元反応の対流ボルタモグラムを示す．電極電位を下げていくと，E＜ 1 V で酸素還元電流i が流れ始め，電極回転数に応じた拡散限界電流iLに達した後は一定となる．拡散限界電流に達するまでの電流は，対流による拡散と反応活性化の混合支配となっており，次のような関係で表される．
1/i ＝ 1/ik＋ 1/iL （2）
ここで，ik は拡散の影響を除いた活性支配電流である．式（2）を変形するとikは次のように表される．
ik＝ i･iL/（iL－ i）（3）
このようにi とiLより得られる活性支配電流ikを用い，電極活性の評価指標として用いられる比活性is（specific acticity），質量活性im（mass activity）は次のように求められる．

is （mA/cm2Pt）＝ ik （mA）/Ptの活性表面積（cm2Pt）（4）
im（mA/mgPt）＝ ik （mA）/Pt担持量（mgPt）（5）
ここで，触媒の活性表面積（Electrochemically activesurface area; ECSA）は後述のCV を用いて決定することができる．
Fig. 4（a）の対流ボルタモグラムより，式（3），（4）を用いて得られるisの電位依存性（ターフェルプロット）をFig.4（b）に示す．式（3）用いた手法は簡便であるが，iLとi の差をとるため，i がiLに近づくと実験誤差やノイズの影響が大きくなる．また，電極活性が低く拡散支配領域が観察できない（iLを決定できない）場合は適用不能となるため，電極回転速度の異なる対流ボルタモグラムをいくつか測定し，Koutecky-Levich プロット（式（6））を用いた解析が必要になる．
1/i ＝ 1/ik＋ 1/（0.62nFACD2/3ν−1/6ω1/2）（6）
ここで，nは反応電子数，Fはファラデー定数（96485 C mol−1），Aは電極の幾何面積（cm2），Cは反応化学種濃度（mol cm−3），D は反応化学種の拡散係数 cm2s−1），ν は溶液の動粘度（ cm2 s−1）， ω は電極の回転角速度（ rad s−1）である．Koutecky-Levich プロットは，1/i をω−1/2に対してプロットして得られる直線をω−1/2＝ 0 に外挿して1/ikを求めるのでiLが明確でない場合でも解析が可能となる．解析の詳細は成書などを参照していただきたい1,5,6）．

f:id:AI_ML_DL:20210812164639p:plain

３　サイクリックボルタンメトリー（CV）
PEFC におけるCV の主要な用途のひとつに活性表面積（ECSA）測定が挙げられる．触媒の活性表面積の大小は電極活性を左右する要因であるので，初期活性・劣化解析のど
の場合においても重要な指標となる．また，PEFC の電極では通常，仕込んだすべての触媒が利用できるわけではなく，同じ触媒材料を使っても触媒層の形成法や作動条件によって触媒電極の中での実際に使える白金の割合（すなわち白金利用率）は変わってくる．CV はハーフセルだけでなく，MEAを用いた測定でも，触媒の活性表面積を簡便に“その場”測定できる解析法である．白金触媒電極の活性表面積評価では，水素吸脱着波の電気量による方法が用いられることが多い．これは，表面積測定のために特別なガスや装置が不要であること，白金表面の水素吸着がアンダーポテンシャル析出（UPD）の機構で進行しマルチレイヤー析出などが起こりにくいこと，清浄な電極では明確なピークが得られることなどによる．

３. １　ハーフセルを用いたCV 測定

測定は分極測定の場合と同様，触媒をグラッシーカーボン電極に固定化して行う．Fig. 5 に0.1 M 過塩素酸水溶液中での白金電極の典型的なCV の例を示す（電流の符号は酸化電
流をプラスとして表示）．

f:id:AI_ML_DL:20210812171503p:plain

0.4 V よりも卑な電位領域で水素の吸脱着ピークが現れる．水素吸着の電気量QHは，Fig. 5 の斜線で示した電気二重層容量電流（水平線）と水素発生ピーク立ち上がり点（垂直線）で囲まれた領域とする．

３. ２　MEA（単セル）を用いたCV 測定
単セルでの発電に寄与可能な活性表面積を求めるにはMEA を用いて単セルを組み，CV 測定を行う．測定上の注意点は，ハーフセルの場合と同様である．セルを室温付近
（あるいは発電時の温度）に保温し，試験極・対極に加湿窒素ガスを流す．試験セル両極内の空気が完全に置換した後，対極側に加湿水素ガスを流し水素雰囲気とする．
ガス拡散電極を用いるMEA でのCV 測定では，水素発生および発生した水素ガスの酸化電流がハーフセルの場合よりもかなり高い電位（～ 0.1 V）から流れ始めることが多く，水素吸脱着波の電流に重なって現れるため，QH計算の誤差原因になりやすい．このような水素発生の電位シフトは，作用極触媒近傍の水素分圧が低下するために生じると報告されており10），これを防ぐためにはCV 測定時に試験極パージ窒素ガスの流量を絞る，もしくは止めることが有効とされている7,10）．

Fig. 6 にPt/C 触媒電極のCV を示す．パターン全体が酸化電流側にシフトする点を除けば，基本的に電解質水溶液中で測定した場合と同様のCV が得られる．このCV シフトは，水素ガスクロスオーバーの影響である．特に薄いフッ素系電解質膜の系ではクロスオーバー水素量が多く（～ 1 mA/cm2の電流に相当），CV の電流値に重畳するクロスオーバー水素の酸化電流も大きくなる．このようにMEA で求めた電極の有効活性表面積（SAMEA）とハーフセル測定などで求めた触媒材料固有の活性表面積（SAcat）との比より白金利用率uPtを決定できる．
uPt＝SAMEA/SAcat （8）
uPtを求める手法としては，後述する一酸化炭素（CO）ストリッピング／CO吸着を利用する手法も提案されている11）．

f:id:AI_ML_DL:20210812211515p:plain

３. ３　CV の応用例1 － 触媒加速劣化・解析－
PEFC の耐久性向上は実用化に向けた重要な課題の1 つであり12），劣化要因の1 つである触媒劣化を抑制することが求められている．通常のPEFC の運転条件では，触媒の劣化現象はゆっくりと進行することが多く，材料やシステムの開発を促進するためには，適切な劣化加速手法とその評価法の確立が重要である．このような目的で，ハーフセルおよびMEA に対して，CV などの電位サイクルを用いる触媒劣化加速評価法が，燃料電池実用化推進協議会（FCCJ）から提案されている（ただし，その後の新たな知見を基に評価法は今後改訂される可能性がある）13）．

Pt/C 触媒（特にカソード）劣化の主要因は，Pt 微粒子の溶解・凝集および触媒担体劣化と考えられている．Pt 溶解はPt の酸・還元を繰り返すことで加速されることが知られているが14），このような環境はOCV と負荷状態を繰り返す負荷変動時のカソード側の条件によくあてはまる．Fig. 7（a）にFCCJ より提案されているMEA での負荷変動試験条件を示す．窒素雰囲気下0.6 V/0.9 V の間で電位サイクルを行うことでPt の酸化還元を繰り返し，Pt 溶解が加速される条件下での触媒安定性を評価する試験法である．

一方，カーボンブラックなどの触媒担体の劣化は1 V を超える高電位で加速されることが知られている15）．通常の状態であれば，燃料電池電極がこのような高い電位にさらされることはないが，例えば起動停止時には逆電流機構とよばれるメカニズムでカソード電位が最大1.5 Vに達することがある16）．このような状態を模擬する起動停止試験としてFig. 7（b）に示すような試験条件が提案されている．窒素雰囲気下0.9V/1.3 V の矩形波サイクルを繰り返すことで，起動停止時の異常電位に対する耐性を評価する．

f:id:AI_ML_DL:20210812213319p:plain

MEA のPt/C カソード触媒電極にFig. 7（b）の起動停止試験を2000サイクル実施した例をFig. 8 に示す．サイクルを重ねるにつれて水素吸脱着波や酸化物層生成・還元ピークが縮小しており，Pt の溶解・凝集が進行していると考えられる．同時に電気二重層電流も徐々に増加し，0.5 - 0.6 V 付近にはカーボン表面の官能基によるレドックスと考えられるピーク対が次第に明確になっていることから，カーボン担体表面の酸化が進行していることも示唆される．CV の水素吸脱着波からECSA を求め，初期値で規格化した値をサイクル数に対してプロットした結果をFig. 8（b）に示す．なお，Fig.8 の例は80 ℃でのCV 測定であるため，水素被覆率低下や水素発生の影響が無視できず正確なECSA 評価は困難になる．しかし，Fig. 8（b）のようなサイクルに伴う相対的な変化を議論することは可能である．ハーフセルを用いる加速劣化試験法については，負荷変動・起動停止試験ともに三角波CV の繰り返しが提案されている13）．Fig. 9 にPt/C 触媒のハーフセルによる起動停止試験（CV三角波1.0 V/1.4 V）を10000サイクル実施した例を示す．Fig. 8 のMEA の結果と同様にECSA の減少と共に二重層容量の増大が確認できる．

f:id:AI_ML_DL:20210812221632p:plain

f:id:AI_ML_DL:20210812221717p:plain

8月13日（金）

MD シミュレーションを用いたアイオノマー薄膜の構造およびプロトン輸送の解析

小林光一他著、燃料電池　Vol.18　No.4　2019

概要：固体高分子形燃料電池（PEFC：Polymer Electrolyte Fuel Cell）は車載用電源や定置用電源として盛んに研究・開発が行われてきた。発電時、PEFC 内部ではプロトンや水素、酸素といった様々な物質が輸送されるため、PEFCの性能向上のためには内部の輸送現象を理解する必要がある。本研究では触媒層内アイオノマー薄膜においてアイオ
ノマー膜厚が膜構造とプロトンの輸送特性にもたらす影響について解析を行った。本研究では分子動力学シミュレーションを用いてナノスケールの構造と輸送について評価を行った。本研究の結果より、アイオノマー膜厚が膜内部の水分子の分布に影響を与え、アイオノマー膜厚がおよそ７nm において最も水クラスターの接続性が高く、プロトン
の自己拡散係数も高くなることが分かった。

１．緒言
　固体高分子形燃料電池（Polymer Electrolyte FuelCell:PEFC）は今後我が国が水素社会へと舵を切っていく上で、特に車載用や家庭用電源といったシーンにおいてその性能が期待され、盛んに開発が行われている。PEFCを広く普及させるためには単位セルの出力密度向上が欠かせない１）。このためには膜電極接合体（Membrane Electrode Assembly：MEA）において分子レベルで構造と輸送の相関を明らかにする必要があり、特に触媒層においてはガスの拡散性、プロトン・電子の伝導性を考慮した構造
の最適化が求められる２）。

・・・・・・・・・・・・・・・・・・・・

２．計算手法
　本研究では炭素壁面上にアイオノマー薄膜が吸着した計算系を作成し、アイオノマー薄膜の膜厚が膜の構造およびプロトン輸送特性に与える影響について分子動力学シミュ
レーションを用いて解析を行った。以下に計算系の構成を述べる。アイオノマー薄膜には Nafion® のモデルを用いた。用いた Nafion® の等価質量（Equivalent Weight：EW）は 1146 であり分子構造は図１に示す通りである。また、解析の精度を保ったまま計算負荷を低減するため、CFnの原子群を１原子として扱う United Atom（UA）モデル 14）を用いた。アイオノマー薄膜の膜厚は系内の Nafion®の本数を変化させることで制御した。アイオノマー薄膜の膜厚は膜の含水率などによって±1nm ほど変化するが、系内の Nafion® 本数とアイオノマー膜厚の目安を表１にまとめた。

・・・・・・・・・・・・・・・・・・・・

３．結果と考察
３．１　最大クラスター長
　まずアイオノマー薄膜の膜厚と膜内の水クラスター構造の相関を解析するために、クラスターサイズの解析を行った。本計算では水・ヒドロニウムイオンの酸素原子の再隣
接原子間距離が 3.3Å 以内にある集合体を水クラスターと定義した。この 3.3Å という距離は Nafion® バルク膜における水分子の酸素原子間の RDF における第一ピークの終端値である 20）。本研究ではクラスターに含まれる水分子数をクラスターサイズとして定義している。図４にλ =３、14 において膜厚を変化させた時の平均クラスターサイズを示した。λ =14 においてはクラスターサイズが膜厚と共に増加する一方、λ =３においてはクラスターサイズに大きな変化がないことがわかる。一般に膜内のクラスターが成長することで、プロトンの自己拡散係数は増加するが、薄膜の場合はクラスターが壁面垂直方向に成長しても水平方向の自己拡散係数への影響が少ないことが考えられる。

・・・・・・・・・・・・・・・・・・・・

３．２　密度分布
　本研究で扱っている炭素壁面上のアイオノマー薄膜の系では、炭素壁面や界面の影響により壁面垂直方向の密度は非一様になっていると考えられる。このような系の密度分
布を求めるため、系内を x×y×z = 1.02 × 0.88 × 1.00Å3の微小なセルに分割してセルごとの密度ρlocal を求めた。またρlocal を壁面水平方向について平均化することによって壁面垂直方向の密度分布を求めた。

・・・・・・・・・・・・・・・・・・・・

３．３　プロトン自己拡散係数
　最後にプロトンの輸送特性とアイオノマー薄膜の膜厚の関係を解析するために、プロトンの自己拡散係数を計算した。拡散係数は平均二乗変位（Mean Square Displacement：MSD）から Einstein の式を用いて計算した。MSDの計算式は式（2）、（3）に示し、Einstein の式を式（4）に示した。なお、自己拡散係数は図２に示すように炭素壁面に対して水平方向に限定した。これは、電解質膜から触媒までのプロトン輸送を考えたとき、炭素壁面に対して水平方向の輸送が大部分を占めるためである。

・・・・・・・・・・・・・・・・・・・・

４．結言
　本研究では MD シミュレーションを用いて PEFC アイオノマー薄膜におけるプロトン輸送特性について解析を行った。プロトンの拡散モデルにaTS-EVB モデルを用いてグロッタス機構による拡散を考慮したプロトン輸送特性の解析を実施した。またアイオノマー薄膜のモデルとして、接触角 90°の炭素壁面を模擬した LJ 壁上に Nafion® 粗視化モデルを吸着させて、アイオノマー薄膜の膜厚を変化させることでプロトン輸送特性や膜構造の変化を解析した。　まずλ＝ 14 （RH=100%におけるバルクNafion膜中の含水率に相当する）においてはプロトンの自己拡散係数とクラスター接続性に膜厚依存性が少ないことがわかった。また、クラスター長は計算領域の大きさとほぼ等しく、これは高含水率時にクラスターが完全に接続していることを示唆している。さらに膜厚増加とともにクラスターは壁面垂直方向に成長しており、これがλ＝ 14 の時にプロトン自己拡散係数の膜厚依存性が小さいことの一因であると考えられる。

・・・・・・・・・・・・・・・・・・・・

面白そうな論文がある。ちょっと覗いてみよう。なんか、これは、凄い結果が得られているようだ！

High Pressure Nitrogen-Infused Ultrastable Fuel Cell Catalyst for Oxygen Reduction Reaction, Eunjik Lee et al., ACS Catal., 11, 5525−5531 (2021)

ABSTRACT:

The mass activity of a Pt-based catalyst can be sustained throughout the fuel cell vehicle life by optimizing its stability under the conditions of an oxygen reduction reaction (ORR) that drives the cells. Here, we demonstrate improvement in the stability of a readily available PtCo core−shell nanoparticle catalyst over 1 million cycles by maintaining its electrochemical surface area by regulating the amount of nitrogen doped into the nanoparticles. The high pressure nitrogen-infused PtCo/C catalyst exhibited a 2-fold increase in mass activity and a 5-fold increase in durability compared with commercial Pt/C, exhibiting a retention of 80% of the initial mass activity after 180 000 cycles and maintaining the core−shell structure even after 1 000 000 cycles of accelerated stress tests. Synchrotron studies coupled with pair distribution function analysis reveal that inducing a higher amount of nitrogen in core−shell nanoparticles increases the catalyst durability.

Ptベースの触媒の質量活性は、セルを駆動する酸素還元反応（ORR）の条件下でその安定性を最適化することにより、燃料電池車の寿命全体にわたって維持できます。ここでは、ナノ粒子にドープされた窒素の量を調整することによってその電気化学的表面積を維持することにより、100万サイクルにわたって容易に入手可能なPtCoコアシェルナノ粒子触媒の安定性の改善を示します。高圧窒素注入PtCo / C触媒は、市販のPt / Cと比較して質量活性が2倍に増加し、耐久性が5倍に増加し、18万サイクル後に初期質量活性の80％の保持を示しました。 1 000000サイクルの加速応力試験後もコアシェル構造を維持します。ペア分布関数分析と組み合わせたシンクロトロン研究は、コアシェルナノ粒子に大量の窒素をドープすると触媒の耐久性が向上することを明らかにしています。

f:id:AI_ML_DL:20210813220657p:plain

INTRODUCTION
Extensive practical applications of the commercial hydrogen fuel cell vehicle have been delayed because of the high cost and limited durability of the membrane electrode assembly (MEA).

One of the main reasons for the high cost of the MEA is the large amount of Pt used to catalyze the oxygen reduction reaction (ORR) at the cathode of the proton exchange membrane (PEM) fuel cell.

In the past decade, several studies investigated ORR electrocatalysts to reduce the cost of the MEA.

One of the main strategies is to add modifiers to the Pt catalyst by changing the structure and morphology of the PtM (metal) alloy catalyst, while others include completely avoiding Pt usage by using various nonprecious M−N−C moiety catalysts.

Although the addition of modifiers can drastically increase catalytic performance, it cannot be sustained for prolonged periods, which is a major factor impeding commercialization.

To date, carbon-supported PtCo alloy nanoparticles have emerged as the best alternative to Pt/C; original equipment manufacturers are already using them in first-generation
hydrogen fuel cell vehicles.

For better Pt utilization efficiency throughout the fuel cell lifetime, an ideal catalyst should be able to maintain its electrochemical surface area (ECSA).

Although earlier studies have corroborated nitrogen’s role in stabilizing the catalyst, high pressures doping of nitrogen in a controlled environment on industrial scale core−shell nanoparticles was not achieved.

先に試して失敗したものを、今回成功させた、ということで、その先導研究を見たら、同じ研究グループのようで、安心した。それが以下の2件。

(25) Kuttiyiel, K. A.; Sasaki, K.; Choi, Y.; Su, D.; Liu, P.; Adzic, R. R. Nitride Stabilized PtNi Core−Shell Nanocatalyst for high Oxygen Reduction Activity. Nano Lett. 2012, 12 (12), 6266−6271.
(26) Kuttiyiel, K. A.; Choi, Y.; Hwang, S.-M.; Park, G.-G.; Yang, T.- H.; Su, D.; Sasaki, K.; Liu, P.; Adzic, R. R. Enhancement of the oxygen reduction on nitride stabilized Pt-M (M = Fe, Co, and Ni) core−shell nanoparticle electrocatalysts. Nano Energy 2015, 13, 442−449.

Thus, in this study, to obtain a highly stable and active ORR catalyst, a highpressure nitriding reactor that can infuse a controlled number of nitrogen (N) atoms into the alloy nanoparticles was developed.

Varying the ratio of N atoms in the PtCo/C core−shell nanoparticles can significantly affect the morphology of the nanoparticles and simultaneously increase their stability
without impacting the activity.

Herein, we report the preparation of N-stabilized PtCo core−shell nanoparticles with ultrastable configurations; the result is a highly durable ORR catalyst that can withstand up to 1 000 000 cycles in accelerated stress tests (ASTs), enabling rapid commercialization of fuel cell vehicles.

To the best of our knowledge, thus far, no catalysts have been reported that can last 1 million cycles.

The best configuration (Pt40Co36N24/C) retained 93% of its ECSA, while its initial half-wave potential decreased by only 6 mV after 30 000 cycles.

This confirms that the proposed configuration is a suitable alternative to the commercial Pt/C catalyst, whose ECSA deteriorated by 40% under similar conditions.

CONCLUSION
We exhibited that nanostructured core−shell materials with high contents of N in their cores can be engineered to sustain harsh and oxidative electrochemical environments during fuel cell operation.

X-ray experiments and PDF analyses revealed that a high N content could protect the Co core against dissolution.

The sustainment of 1 million cycles after harsh and corrosive ASTs without significant dissolution facilitates the potential industrial scale application of the catalysts.

This strategy presents a promising approach to develop cheap and ultradurable core−shell catalysts using other 3d transition metal cores.

8月14日（土）

High Pressure Nitrogen-Infused Ultrastable Fuel Cell Catalyst for Oxygen Reduction Reaction, Eunjik Lee et al., ACS Catal., 11, 5525−5531 (2021)

RESULTS AND DISCUSSION
Carbon-supported PtCo core−shell nanoparticles were prepared by reducing platinum acetylacetonate [Pt(acac)2] and cobalt acetylacetonate [Co(acac)2] via ultrasound-assisted polyol synthesis.

Transmission electron microscopy (TEM) analysis showed that the as-synthesized PtCo nanoparticles exhibited a core−shell structure with an average particle size of ∼2.3 nm (Figure S1).

f:id:AI_ML_DL:20210814101356p:plain
Scanning TEM (STEM) and energy dispersive X-ray spectroscopy (EDS) confirmed the core−shell structure with 1−2 Pt monolayers on the Co-rich core (Figure 1B−D).

f:id:AI_ML_DL:20210814101734p:plain

The PtCo core−shell nanoparticles were annealed in an argon/ammonia mixture (N2/NH33: 5/95) at 510 °C in three pressurized environments (1, 40, and 80 bar).

The nanoparticles maintained their core−shell structures and exhibited an increase in the particle size and a change in composition (Figure 1F−H).

f:id:AI_ML_DL:20210814101950p:plain
As shown in Figure 1E, higher pressure increases the N content in the nanoparticles but ultimately decreases the particle size.

f:id:AI_ML_DL:20210814102601p:plain
On the basis of the N content in the nanoparticles, the molar ratio changes drastically; the resultant nanoparticles are denoted as Pt52Co48/C, Pt53Co45N2/C, Pt44Co42N14/C, and Pt40Co36N24/C (Table 1).

f:id:AI_ML_DL:20210814102840p:plain
For all samples, in-house X-ray diffraction (XRD) patterns exhibit the typical face-centeredcubic (fcc) structure, with no phase segregation, corresponding to Pt and its alloys with transition metals (JCPDS, No. 87- 0646) (Figure 1A).

f:id:AI_ML_DL:20210814103604p:plain
The position of the (111) peak of PtCo/C shifts to a higher angle compared with that of Pt/C, indicating that Co atoms with relatively smaller atomic sizes are incorporated into the Pt lattice, causing compressive strain.

Interestingly, the nitriding pressure directly affects the full width at half-maximum (fwhm) and position of the (111) peak.

In particular, the fwhm increases and the (111) peak position gradually shifts to a lower angle with an increase in the nitriding pressure.

This suggests that the nitriding pressure changes the atomic structure of the catalyst particles while relaxing the lattice mismatch between Pt skin and cobalt nitride core (Table 1).

Furthermore, X-ray photoelectron spectroscopy (XPS) studies indicate that, compared with metallic Pt, the Pt 4f peak in all samples shifts to a lower binding energy (BE), likely owing to the charge transfer from Co to Pt (Figure S2).

f:id:AI_ML_DL:20210814103946p:plain
Additionally, no peaks (∼399.8 eV) for imides/lactams/amides are observed, indicating that most N in the samples exists in the form of nitrides.

To gain further insights about how the as-synthesized PtCo core−shell nanoparticles maintain their structures while incorporating N atoms, we carried out ab initio molecular
dynamics (AIMD) studies to simulate the formation of the CoN nanophase in the nanoparticle core.

Before the conduction of AIMD, the NH3 molecules were packed into a unit cell with cuboctahedral PtCo nanoparticles under pressures of 1, 10, and 45 bar by use of the COMPASSII force field.

We considered the entropic effect to identify the continuous reaction process incorporated at a finite temperature of 783 K.

In the case of a single PtCo nanoparticle, it is found that N atoms from the NH3 molecules cannot penetrate the Co core even at a high pressure of NH3, as shown in Figure S3 and Movie S1.

Therefore, we tested the case of formation of PtCoN core−shell nanoparticles through a particle growth process involving the agglomeration of the preformed PtCo fragments into nitride cores that are consequently covered by a Pt shell.

The results shown in Figure 2A indicate that this is the likely mechanism of the particle size increasing from ∼2.3 nm for pure PtCo nanoparticles to ∼4.2 nm for Pt53Co45N2/C (Table 1).

f:id:AI_ML_DL:20210814114211p:plain
Interestingly, AIMD studies are appreciably consistent with the observation that two Pt12Co1 nanoparticles at 10 bar of NH3 (e.g., 28.7 bar at 783 K) can spontaneously merge without any considerable activation barrier.

The simulations indicate the formation of irregular particles with a compressed Pt−Pt distance depending on the location of nearby N atoms, as revealed by the atomic pair distribution function (PDF) analysis and the reverse Monte Carlo modeling (discussed below), thereby increasing the number of N atoms that exist near the Pt sublayer.

In situ Co K edge X-ray absorption near-edge structure (XANES) spectra of Pt52Co48/C, Pt53Co45N2/C, Pt44Co42N14/ C, and Pt40Co36N24/C nanoparticles (Figure 2B) were
obtained in 0.1 M HClO4 at a potential of 0.42 V.

f:id:AI_ML_DL:20210814114409p:plain
As the N concentration increases, the peak intensity at 7724 eV starts decreasing; the highest peak at 7727 eV is observed at a N concentration of >14 at%.

This change can be ascribed to a change in the electronic structures of Co due to N doping.

As shown in Figure S7, the XANES spectra of CoO (Co2+) and Co3O4 (Co2.67+) exhibit the highest peaks at 7725 and 7729 eV, respectively; meanwhile, the highest peak for Pt40Co36N24/C lies between them.

f:id:AI_ML_DL:20210814115232p:plain
Thus, the N doping of PtCo catalysts alters the electronic state of Co, resulting in an increase in the oxidation state.

The increase in the oxidation state with an increase in the N content is also supported by the data shown in the inset of Figure 2B; half-step energy values (at 0.5 of the normalized absorption in the XANES spectra) increase with an increase in the N concentration.

Figure 2C shows the in situ Pt L3 edge XANES spectra of the PtCo/C and N−PtCo/C catalysts measured in 0.1 M HClO4 at a potential of 0.42 V.

f:id:AI_ML_DL:20210814115716p:plain
The intensities of the white lines (first peaks in XANES data) change with the variation in the N content in the N−PtCo/C catalysts.

As shown in the inset of Figure 2C, the intensity increases with increase in N concentration; it is higher than that of a Pt foil but lower than that of the PtCo/C catalyst.

The change in white line intensity is related to the d-band structure in Pt. It is well-known that higher intensities correspond to an increase in d-band vacancy; that in turn lowers the adsorption of the intermediate molecules (such as OOH and OH) on the Pt surface.

Thus, N doping can weaken the interaction of the Pt surface with oxygen, compared with that of bulk Pt.

However, the effect is not as strong as that for the PtCo/C catalyst as the white line intensity for the N−PtCo/C catalysts is lower than that of PtCo/C and varies with the N content.

The XANES data suggest that N doping in N−PtCo/ C alters the electronic states of Co and Pt, resulting in moderate adsorption strength of oxygen on the Pt surface.

To comprehensively understand the particle structure, highenergy synchrotron XRD experiments coupled with atomic PDF analysis were carried out.

Experimental PDFs (Figure S8) were fit with 3D models for the nanoparticles using classical molecular dynamics (MD) simulations and were further refined against the experimental PDF data by employing reverse Monte Carlo modeling.

Cross sections of the models emphasizing the core−shell characteristics of the particles are shown in Figure 3.

The models exhibit a distorted fcc-type structure and reproduce the experimental data in exceedingly good detail (Figure S8).

The bonding distances between the surface Pt atoms and surface Pt coordination numbers extracted from the models are also shown in Figure 3.

As observed, PtCo core−shell particles exhibit large structural distortions (∼1.8%).

The surface Pt−Pt distance in Pt53Co45N2 is 2.739 Å, which is approximately 1.5% shorter than the surface Pt−Pt distances in bulk Pt (2.765 Å).

Furthermore, the surface Pt−Pt distance in PtCo is 2.731 Å, indicating 0.3% more strain compared with the strain observed in the Pt53Co45N2 particles.

This indicates that N relaxes the compressive stress in PtCo core−shell particles.

Moreover, the average surface Pt coordination number for the particles with CoN cores increases and becomes more evenly distributed than in the case of pure Pt particles; that is, the surfaces of N-treated particles appear less rough (fewer undercoordinated sharp edges and corners), which can affect the binding strength of oxygen molecules to the particle surface and accelerate the ORR kinetics.

As expected, the N-treated particles show an increased number of N atoms located near the Pt shell, which explains the increased stability of the nanoparticles compared with those of pure Pt and PtCo particles.

The electrochemical performances of all the catalysts were compared using cyclic voltammetry (CV) curves (Figure S4).

The incorporation of Co into the Pt nanoparticles increases the ECSAs of the catalysts, while that of N into the PtCo nanoparticles decreases their ECSAs (Figure 4A).

f:id:AI_ML_DL:20210814184219p:plain
A slightly different trend was observed with respect to the specific and mass activities of the catalyst (Figure 4B).

f:id:AI_ML_DL:20210814184317p:plain
The PtCo/C catalyst with low nitrogen content shows the highest activity among the catalysts; however, an increase in N content does not drastically change its catalytic behavior.

Our study was mainly focused on achieving structural stability of the catalyst.

AST cycles at 0.6−0.95 V and 3 s hold were employed for each catalyst.

All N-infused PtCo/C catalysts showed higher stability and activity compared with commercially available Pt/C and PtCo/C catalysts (Figure S5).

f:id:AI_ML_DL:20210814184759p:plain
The catalyst with the highest N amount (Pt40Co36N24/C) retained 93% of its ECSA, with a decrease of only 6 mV in its initial half-wave potential after 30 000 cycles.

To further investigate the structural integrity of all the catalysts, we cycled them until the ORR activity decreased to half its initial value.

As observed in Figure 4C, most of the N-infused catalysts retained their structures up to 230 000 cycles; however, the catalyst with the highest amount of N (Pt40Co36N24/C) retained its structural integrity until 1 000 000 cycles and lost just 44 mV from its initial half-wave potential (Figure S6).

f:id:AI_ML_DL:20210814185011p:plain

f:id:AI_ML_DL:20210814185125p:plain
Fuel cell (25 cm2) performance tests with 0.1 mg cm−2 Pt content showed promising results (Figure 4D,E). The Pt40Co36N24/C catalyst achieves the U.S.

Department of Energy durability target of a 30 mV voltage drop at 0.8 A cm−2 after 30 000 ASTs (Figure 4H).

Moreover, considering the particle size growth after 30 000 ASTs, the PtCo nanoparticles grew by 41% from their initial average size (Figure 4F), whereas the N-infused PtCo nanoparticles grew by 21%, confirming that N plays a key role in impeding nanoparticle coarsening (Figure 4G).

As previously reported, DFT-based studies clearly support the higher ORR activities of
nitride-stabilized Pt−metal electrocatalysts over Pt/C catalysts.

Their volcano-like trends show that the interactions of Pt/C and PtCo/C with oxygen are significantly stronger and weaker, respectively, compared with those of PtCoN/C.

The outstanding stability of high-pressure N-infused PtCoN/C catalysts can be easily explained on the basis of our resent DFT findings.

The segregation effect of Pt facilitated by the higher N concentration in turn facilitates the diffusion of Pt atoms to the vacant sites of the outermost shell, preventing dissolution.

Evidently, these results demonstrate the enhanced catalytic stability of the Pt40Co36N24/C catalyst over the other N-infused PtCo catalyst.

次のレビューは、様々な触媒の作り方が、網羅的に紹介されている。

Ultra‑low loading of platinum in proton exchange membrane‑based fuel cells: a brief review, Aristatil Ganesan and Mani Narayanasamy, Materials for Renewable and Sustainable Energy (2019) 8:18

Abstract
This review report summarizes diferent synthesis methods of PEM-based fuel cell catalysts with a focus on ultra-low loading of Pt catalysts. It also demonstrates fuel cell performances with ultra-low loading of Pt catalysts which have been reported so far, and suggests a combination method of synthesis for an efficient fuel cell performance at a low loading of Pt catalyst. Here, maximum mass-specifc power density (MSPD) values are calculated from various reported performance values and are discussed, and compared with the Department of Energy (DOE) 2020 target values.

Introduction

・・・・・・・・・・

Regrettably, expensive platinum group metal (PGM) catalysts block the commercial sales/volume. PGMs (plus application) cost contribute to the total cost of FC stack
from 21% (1000 FC systems/year) to 45% (500,000 systems/year) [5] as expected. Since PGMs are expensive, the PGMs loading should be reduced from current (target) levels. As
PGMs play a critical role in both hydrogen oxidation reaction (anode–HOR) and oxygen reduction reaction (cathode– ORR) of the fuel cell, the challenge is ahead in the PEMFC community to address PGM cost issues for its use in both anode and cathode of the fuel cell.

・・・・・・・・・・

According to DOE 2020, the loading target of PGM is 0.125 mgcm−2 and < 0.1 mgcm−2 for the anode and cathode, respectively. Nevertheless, a still lower loading of about 0.0625 mgcm−2 is required for PEMFC vehicles to stand along with IC engine vehicles.

Literature

Many research groups are working on Pt alloy catalysts such as PtCo; PtNi; PtCoMn; WSnMo; PtRu; PtAgRu; PtAuRu; PtRhRu; and Pt–Ru–W2C to replace Pt/C [7–10]. By providing high surface area carbon supports, Pt content could be reduced with high Pt utilization [11, 12]. Using the plasma sputtering technique [13], the total Pt loading in both anode and cathode is reduced to 20 μgcm−2. By this method, uniform dispersion of Pt as clusters with size less than 2 nm is achieved with high catalyst utilization.

Most researchers have made an attempt to reduce Pt loading by providing novel catalyst supports such as multiwalled CNT and single-walled CNT [14]. Binary alloys of Pt, Pt–Cu [15], Pt–Co [16–19], Pt–Ni [17, 18], Pt–Cr [17] revealed 2–3 times higher mass-specific activity than Pt/C, which is due to alloy effects and ligand effects. A ternary alloy of PtFeNi and PtFeCo [19] showed excellent ORR activity, but in some cases presents Pt particles aggregation. A bimetallic alloy of Pd–Pt on hollow core mesoporous shell carbon (PtPd/HCMSC) demonstrated enhanced ORR activity and stability [20, 21]. Recently, a core shell of PtCo@Pt offered low loading of catalyst, but it had a disadvantage of base metal cobalt (Co) leaching (dissolution) from bulk to surface [22]. Wang et al. [21] investigated PtNi alloy as a high-performing catalyst for automotive applications with a low loading of Pt: 0.125 mgcm−2 which satisfied the DOE 2020 target.

Pt–Ni alloy catalyst synthesized by direct current magnetron sputtering involves Pt sputtering on synthesized PtNi/C substrate which forms multilayered Pt-skin surface,
with superior ORR activity. This catalyst involves the mature technology of synthesis with improved performance compared to Pt/C. Though this catalyst presents superior performance, it involves careful preparation of Pt target material for sputtering (costly), preparation of PtNi by chemical reduction, thermal decomposition, and acid treatment with final heat treatment. Materials’ preparation involves many steps and needs careful optimization for getting a reasonable yield of catalyst. Durability studies were not conducted at the MEA level as it is specified by DOE.

Kongkanand et al. investigated [22] PtCo on high surface area carbon (HSC), which demonstrates a less degree of PtCo particle coalescence after the stability test. Also,
HSC is favored for start-up performance and long-term durability. The dissolution of Pt and Co was resolved by developing a deposition model [23]. DOE has updated its
cost estimation for an automotive fuel cell by 15%, i.e., $45/ kW, because of the development of catalyst, PtCo/HSC. This catalyst system would reduce the total cost of the system to 14% or $7.5/kW [22]. These catalysts (PtNi/PtCo) cost about $15.20/g for cathode (Pt 0.100 mgcm−2) and $10.86/g for Anode (Pt 0.025 mgcm−2) [1]. Chen et al. investigated Pt3Ni nanoframes and demonstrated high mass activity with durability, but MEA performance at high current density was challenging [23]. The shape-controlled synthesis of Pt–Pd and Ru–Rh catalyst showed high mass activity and it offers a commercially efficient scale-up method. This catalyst has issues with performance at the MEA level and stability [24]. In addition to catalyst support modification, and alloying of
Pt, for Pt content reduction, a proper MEA fabrication methodology is to be identified for low Pt loading. This review provides intensive guidance for researchers working on low
Pt loading catalyst for fuel cells.

Most promising methods for the preparation of electrodes

Though there are several methods such as physical vapor deposition, chemical vapor deposition, sputter deposition, galvanic replacement reaction (Pd nanocrystals with different shapes) [24] hydrothermal synthesis [25] electrodeposition (hetero-structured nanotube dual catalyst) [26] electrospinning [27] and molten salt method [28], electrodeposition [29] are available for catalyst synthesis and coating, only very few methods are practically feasible for producing nanoparticles of a catalyst and its efficacy for coating on electrodes.

Electrodeposition
The need and necessity for nanostructured energy materials with high surface area, and for its efficient application in energy conversion devices, can be achieved only with the
electrochemical synthesis route. Electrodeposition technique proves to be the best method for the following reasons.

1. Electrode potential, deposition potentials, current densities, and bath concentrations could be controlled for the synthesis of homogenous nanostructured materials.

Hence by varying deposition parameters, one can synthesize thin catalyst film, with desired stoichiometry, thickness, and microstructure.

2. Particle size, desired surface morphology, catalyst loading, thickness, and microstructure can be easily achieved using various control parameters involved in
electroplating.

3. Electrochemical reactions proceeded at ambient temperature and pressure, as high thermodynamic efficiency during plating is maintained.

4. Environmentally friendly.

5. Synthesis can be started with low-cost chemicals as precursor materials.

6. One-pot single-step synthesis of the final product is possible by avoiding a number of steps.

7. Any metal or alloy can be easily doped into desired nanostructured materials.

8. The required nanostructured energy materials can be directly grown on the electrode surface by electrochemical method, and it provides good adhesion, large surface area, and electrical conductivity.

And hence, this method is found suitable for construction of energy devices with high efficiency and with low cost.

9. By this method, materials with poor electrical conductivity of metal oxide used as catalyst supports can be easily incorporated into advanced energy materials and will facilitate fast electron transport mechanism. Therefore, electrical conductivity of catalyst supports can be enhanced by the electrodeposition method.

10. The electrochemical synthesis route eliminates the complexity of mixing catalyst powders with carbon black and polymer binder in fabricating electrodes for fuel cells in a short time [29].

・・・・・・・・・・

Chemical precipitation method

A thin nanocatalyst layer is formed by the reduction of reducing agent in the precursor solution. The desired particle size of the catalyst can be achieved by varying parameters,
such as temperature, pH, the ratio of reducing ion to Pt, reaction time, and stirring rate. The main disadvantage of this method is producing irregular particle size and shape, and resulting in the inhomogeneous layer. This formation is due to various growth kinetics and conditions, and thus it is least used for catalyst synthesis.

Colloidal method

By this method, colloidal dispersion is formed by stabilizer and the precursor. The suitable support material is added and by which colloid deposition occurs on the support surface. In the final stage, the decomposition of colloid results in the formation of catalyst. The common colloidal particles formed by the precursors, H2PtCl6 and RuCl3,
and reduced with reducing agent. The stabilizers and reducing agents present in the final product will have to be removed by thermal treatment. This method involves various steps to be followed for the catalyst synthesis.

Sol–gel synthesis method

This method allows forming solid particles suspended in liquid solution (sol) and upon subsequent aging, and drying to form a semi-solid suspension in a liquid (gel). And subsequent calcination results in a mesoporous solid or powder formed on the substrate. Pore size distribution on the catalyst layer can be varied by various experimental parameters. The disadvantage associated with this technique is catalytic burning in pores, makes them inaccessible to reactants, and resulting in low catalyst utilization.

Impregnation method

This method uses high surface area carbon supports for the formation of catalyst. In this method, chloride Pt salt directly mixed with reducing agents, Na2S2O3, NaBH4, N2H4,
formic acid, and H2 gas in an aqueous solvent. This method results in Pt agglomeration and weak support due to the high surface tension of the liquid solution [56].

Microemulsion method

The water-soluble inorganic salt was used as a metal precursor in the solution. Here the particle growth rate, size, and shape are being decided by a proper proportion of metal
salt and organic solvent and the resulting solution forms water-in-oil structure (microemulsion). The hydrophobic property of organic molecules protects the metal particle as an insulation layer and prevents agglomeration when the reducing agent is added. That is a surfactant-assisted synthesis of catalyst which forms suitable catalyst support with the protection layer. The main drawback of this method is the use of expensive chemicals and not being environmentally friendly [39].

Microwave‑assisted polyol method

Here, Pt metal salts are reduced in ethylene glycol, and the reduction reaction occurs at a temperature above 120 °C. Microwave-assisted heating could produce more active ORR
catalyst than the conventional heat treatment. Microwave heating produces uniform dispersion and greater morphological control over particle size (< 3 nm). The main advantage of this method is that it has no surfactant addition and uses an inexpensive solvent like ethylene glycol. The disadvantage associated with this method is that it is time consuming.

Chemical vapor deposition (CVD)

This method uses the required precursors in the gas phase using external heat energy plasma sources in an enclosed media-assisted chamber. The thin solid film formed on the
substrate by decomposition reaction of precursors. The impurities produced during reaction is removed by the flowing media gas into the chamber. This method is most widely used for the synthesis of advanced materials like CNT and graphene. This method involves a huge cost for instruments and process.

Spray technique

Spray painting involves printing techniques for coating catalyst directly on the substrate, and it involves inkjet printing, casting, sonic method, etc. The advantage of printing technique is that we can coat a large area of the electrode, irrespective of surface (conductive or non-conductive) of the substrate. After coating, the coated surface is allowed for evaporation of the solvent. Though many advantages are provided by this technique, it has a large influence on practical applicability and mass production, so catalyst utilization is very low.

Atomic layer deposition (ALD)

This method is under the sub class of CVD. Here gas phase molecules are used sequentially to deposit atoms on the substrate. The precursors involved react on surface one at a time, in sequential order. The substrate is exposed to different exposures at different time and forms uniform nanocoating on the substrate. This method involves four steps to complete the whole process: (1) exposure to precursor first, (2) purging of the reaction chamber, (3) exposure to second reactant precursor, and (4) a further purge of the reaction chamber. During step 1 and 2, the precursors react with the substrate at all available reactive sites. The unused precursors and impurities are removed by purging the inert gas. During the third stage, the adsorbed precursor on the substrate starts
reacting with reactant precursor to eliminate ligands of the first precursor for forming target material, while the residues formed in step 3 are eliminated in step 4 of inert gas purging which complete one cycle; likewise many cycles are repeated to achieve desired thickness of the target material.

Key features to consider when preparing the electrodes

In emerging hydrogen economy, fuel cell technology developments need to be redressed in cost effectiveness and benchmark performance as directed by DOE US and operation under long life cycles. There are many ways to reduce the cost of fuel cells without sacrificing performance and are [45–50] listed below:

1. reduction of precious metal loading.
2. Nanostructured thin-film (NSTF) development for catalyst layer.
3. Particle size reduction for electrocatalyst.
4. Developing non-precious metal/alloy.
5. Developing novel catalyst preparation methods.
6. Using novel MEA fabrication methods to adopt for advanced catalyst and membrane materials.
7. Adopting new techniques to promote triple-phase boundaries and mitigate mass transfer limitation.
8. Attempt to develop carbonaceous and non-carbonaceous catalyst support materials to achieve peak performance at low-cost investment.

In addition to various useful applications of PEMFC, still, it has to go a long way in terms of catalyst for successful commercialization, like cost, efficiency, and cycle stability. Even
now, Pt/Pt-based materials hold its strong position in functioning as an efficient catalyst for PEMFC and DMFC, as it exhibits superior catalytic activity, electrochemical stability, high exchange current density, and excellent work function [50–53].

Due to the lack of Pt resources in earth’s crust, they are a costlier and limited supply for industries. In regard to PEMFC automotive applications, the present resources of Pt are not sufficient to fulfill the requirements, and the obtained ORR activity is also not up to the benchmark performance [51]. Because of these reasons, researchers are now focused mainly on synthesizing ultra-fine nanoparticles of Pt, alloying with other metals, and ultra-low loading of Pt on highly porous, high surface area metal oxide/composite support to reduce the cost without sacrificing the performance [52]. Usually, conductive porous membranes are used as catalyst support materials for PEMFC and DMFC, but the use of metal catalyst support shows higher stability, and activity when compared to unsupported catalyst.

The typical characteristics of catalyst support are as follows:
・High surface area.
・Ability to maximize triple-phase boundary through their
・mesoporous structure.
・Good metal–catalyst support interaction.
・High electrical conductivity.
・Good water management.
・Increased resistance to corrosion.
・Ease of catalyst recovery [54].

Support material, in addition to increasing catalytic activity and durability, also determines the particle size of a metal catalyst. Hence, the choice of support material should be chosen, in such a way that it supports performance, behavior, long cycles of operation, and cheaper cost of catalyst. The following steps should be considered for developing a new catalyst system,

• Developing non-precious metal catalyst.
• Choice of suitable catalyst support materials.

The metals other than Pt group are palladium, ruthenium, rhodium, iridium, and osmium. The availability of these metals is scarce compared to Pt. Hence by incorporating all the above points and alloying with non-Pt group metals, the loading of precious metal could be reduced with higher performance [55]. The essential properties of support materials
discussed above are important to achieve better performance of fuel cell at a cheaper cost.

Stability

The major issue with PEMFC catalyst is long-term durability. During the continuous operation of PEMFC, catalytic agglomeration, and electrochemical corrosion of carbon-based support result in deterioration of catalyst activity [53]. By choosing the correct catalyst support, one can eliminate the agglomeration of catalyst, and corrosion of support. With the existing carbon black support, the electrochemical corrosion triggers at above 0.9 V which results in the catalyst getting detached from support, and agglomerates. It will create a lack of diffusion of fuel/oxidant reactants and reduces overall fuel cell performance, and life. These issues force us to find a solution for long cyclic stability of PEMFC by choosing proper support, which has strong electrochemical stability under acid/alkaline medium.

The most widely used support materials are carbon black with various grades from various companies based on quality in terms of porosity and surface area. Since from last
decade, the researcher’s focus is on nanostructured catalyst supports, as they deliver faster charge transfer, surface area, and improved catalytic activity. They are broadly classified into carbonaceous and non-carbonaceous supports. Carbonaceous type includes different types of modified carbon materials such as mesoporous carbon, carbon nanotubes (CNTs), nanodiamonds, carbon nanofibers (CNF), and graphene [36, 54–61]. This nanostructured modified carbon offers high surface area, high electrical conductivity, and good stability in acid and alkaline environments. High crystallinity
of carbon nanomaterials, such as CNT and CNF, exhibits stability and good activity [62].However, under repeated cycles of fuel cell operation, carbon materials such as carbon black face serious problems of corrosion. Though there is considerable decrement of corrosion rate with higher graphitic carbon materials such as carbon nanotubes, carbon nanofibers, they do not prevent carbon oxidation [63]. To achieve high corrosion/oxidation resistant, stability, and durability; metal oxides are preferred
as a good catalyst support material instead of carbon [52, 64]. Metal oxides offer [62, 64]:

high electrochemical stability, mechanical stability, porosity, high surface area, cycling stability and durability [62, 63].

Debe et al. derived development criteria for automotive fuel cell electrocatalysts as given in Tables 1 and 2. They proposed that increased surface area of catalyst will improve the activity of the outer Pt layer [65]. Nanostructured thinfilm (NSTF) catalysts will give high surface area for efficient activity for the catalyst. NSTF electrocatalysts offer areaspecific activity of the catalyst, catalyst utilization, stability, and performance with ultra-low PGM loadings.

Problems associated with ultra‑low loading

During continuous operation of fuel cell, there will be a loss in ECSA due to dissolution, agglomeration, and Ostwald ripening. So, catalyst stability and durability are being
decided by ECSA loss before and after operation of specified hours. Most recent catalyst systems with ultra-low loading present very high mass activity (30 × higher mass activity
vs. Pt/C), but they fail at high current density targets. For example, core–shell (Pt@Pd/C)catalysts exhibit higher mass activity but undergo some degree of base metal dissolution [71]. So, new catalyst development with the focus on ultralow loading of precious metal and stability at high current densities (HCD) is required even though they exhibit higher mass activity.

Requirements of cathode catalysts

PGM alloy shows high performance at the beginning and offers higher ohmic/mass transport losses during continuous operation. During long cycling, a conventional Pt/C lost its performance by degradation (dissolution, agglomeration, and Ostwald ripening). And PGM alloy contaminates ionomer by the dissolution of ions and results in additional
performance loss at high current densities. Hence, a novel cathode catalyst layer is required for high performance and durability. As pointed out earlier, most Pt alloy catalysts with high mass activity show high performance at low current densities, but suffer from performance loss at high current densities due to base metal or support dissolution, and it is progressive when operating under voltage cycling. Hence, a novel cathode catalyst layer design is proposed to get rid of the above-discussed problems and to deliver stable performance/ durability. Dustin Banham et al. [72] presents realworld requirements for the design of PEMFC catalysts.

Requirements for PEMFC anode catalyst

Platinum is a superior catalyst for hydrogen oxidation reaction in the anode of the fuel cell, and it accounts for 50% of the fuel cell cost [72]. During the stack operation, if flow field in anode side is blocked, the current forces malfunctioning of the cell, and stack. Materials such as carbon, catalyst, water present in the anode layer oxidized to supply the necessary electrons. This is, in turn, leads to high anodic potential (> 1.5 V), and
the deterioration of the anode catalyst layer. This implies that the requirement of a novel catalyst layer with strong support material which has electrochemical stability and durability. Nowadays, the catalyst research group must have a strategy to test their catalyst for fuel cell performance and durability at the MEA level. It will further require real-time stack testing and optimizing various parameters by incorporating interdependency of various materials involved in the system.

Maximum mass‑specific power density (MSPD)

DOE has targeted maximum mass-specific power density (MSPD) values [73], which account for both low Pt anode and low Pt cathode catalysts, as an index for performance
with reference to Pt loading. DOE targets more than 5 mW μg−1 Pt total at cell voltages higher than 0.65 V [Department of Energy (DOE)]. This cost reduction to meet DOE target 2020 is possible if we could reduce Pt loading in MEAs to less 125 μg cm−2 MEA. In general, it is classified into three regions: (1) > 5 mW μg−1 Pt total (2) between 1 and
5 mW μg−1 Pt total (3) < 1 mW μg−1 Pt total. The maximum MSPD value 8.76 mW μg−1 Pt total at 0.65 V is obtained by a proprietary catalyst, PtNi/PtCo, of General Motors and United Technologies Research Center (UTRC), and stack modeling performed by ANL [23] (Fig. 3).

f:id:AI_ML_DL:20210815214707p:plain

Catalyst synthesis and deposition methods: MSPD values

Various catalyst synthesis methods are listed in Tables 4 and 5 with a primary focus on how an ultra-low loading of catalyst impacts the fuel cell performance by the influence
of maximum mass-specific power density (MSPD) values. Each method has achieved maximum performance with low loading of catalyst within the boundary of its limitation.

Combination method of synthesis and coating

By comparing all synthesis methods (Fig. 4), it is found that the combination method of synthesis and coating (e.g., spraying and sputtering) has achieved increased MSPD values
than the specific method of synthesis. It is also encouraged to note that the combination method of synthesis and coating may eliminate the limitation posed by a specific method. In this review, for example, electrodeposition and plasma sputtering/spraying synthesis methods are recommended for developing an efficient catalyst system which would deliver good performance and stability, at high current density with long-term durability. Here the disadvantages posed by each method are overcome by other methods. Any catalyst synthesis and coating technique, which is being scaled up with
high performance/durable catalyst layer, is now a superior priority. Hence, greater attention should be paid not only towards the alloy catalyst but also the catalyst preparation methods, and choice of catalyst support materials [64]. Table 4 shows various catalyst synthesis methods and respective MSPD values along with reference. Table 5 shows various synthesis methods and their merits and demerits.

Conclusion

Here a brief review of various catalyst synthesis methods and their efficacies is performed with a focus on ultra-low loading of catalyst. Also, the merits and demerits of various
synthesis methods are discussed. The ultra-low loading in electrodes was discussed in terms of MSPD values, and is compared with DOE 2020 target values. The catalyst
prepared by any combination of the method of synthesis which results in MSPD values more than 5 mW μg−1 Pt total at > 0.65 V will be the best catalyst to meet the target
of DOE 2020.

機械学習の活用に関する文献は日増しに増えている。

機械学習の用途は、構成部材毎の、新規材料の探索、基礎データからの性能予測、パーツの構造最適化、性能評価、性能劣化予測、燃料電池のシステムとしての最適化や劣化の解析など様々な用途があって、ブログで紹介できるようなレベルのものではないことがわかってきた。分析解析技術においても、結果の解析には第一原理計算や分子動力学を組み込んだニューラルネットワークで簡単に計算できて、分析データの解析から性能評価や劣化解析における物理化学的な原因解明において、その理論計算結果を活用できるようにしていく必要がある。個々の分析結果の解析においては、再現性向上や自動化による解析時間の短縮などにも使えると思うのだが、これは、実際にデータをみながらやっていくことになるのだろう。分析解析結果の解析の高度化や自動化には、通常の報告書には記載されない詳細データが必要になるので、分析を依頼する前に、必要なデータを開示してもらえるかどうかを確認したり、ノウハウの開示とか、全て電子データにしないと使えないので、データフォーマットを決めておくとか、1つ１つクリヤ―していこう。ほんとうに分からないことのほうが多いので課題を正しくとらえて本質的な課題から順に攻めていく必要があるようだ。

8月20日（金）

機械学習をXPS分析に適用した例をみてみよう。

Deep neural network for x-ray photoelectron spectroscopy data analysis
G. Drera, C. M. Kropf and L. Sangaletti, Mach. Learn.: Sci. Technol. 1 (2020) 015008

Abstract
In this work, we characterize the performance of a deep convolutional neural network designed to detect and quantify chemical elements in experimental x-ray photoelectron spectroscopy data.

Ｘ線光電子分光スペクトルの測定スペクトルをCNNに入力すれば、定量結果と化学状態分析結果が得られる、ということかな。

Given the lack of a reliable database in literature, in order to train the neural network we computed a large (<100 k) dataset of synthetic spectra, based on randomly generated materials covered with a layer of adventitious carbon.

文献には、信頼できるデータベースが少ないので、100 k近いスペクトルを、ランダムに選んだ組成で、表面には汚染炭素が存在するような試料を想定して、シミュレーションにより作成したようである。

The trained net performs as well as standard methods on a test set of≈500 well　characterized experimental x-ray photoelectron spectra.

訓練したネットワークは、約500セットの測定スペクトルに対して、標準的な方法（データ解析）で得るのと同等のパフォーマンスを示したとのことである。

Fine details about the net layout, the choice of the loss function and the quality assessment strategies are presented and discussed.

CNNの詳細、損失関数の選択、性能評価結果などについて述べているようだ。

Given the synthetic nature of the training set, this approach could be applied to the automatization of any photoelectron spectroscopy system, without the need of experimental reference spectra and with a low computational effort.

シミュレーションスペクトルを用いることによって実験的に得られた参照スペクトルを用いなくても、実験スペクトルに対する解析結果を、自動的に出力することができるようである。

全体の流れ：

f:id:AI_ML_DL:20210825223939p:plain

DNNの構成は次の図に示されている。スペクトルは画像としてではなく、1次元のデータ列として認識（入力）するようになっている。汚染炭素の厚さと81種類の元素の定量とは分けて評価するようになっている。今回の方法で、かなり良い結果が得られたということは、シミュレーションスペクトルの作成技術のレベルが高いということが推測される。そのシミュレーションに用いている物理を、ニューラルネットワークにも組み込めば、より正確な解析結果を得ることができそうに思う。

f:id:AI_ML_DL:20210825221926p:plain

シミュレーションスペクトルの例：

自然な感じで、非常に良くできている気がする。相対強度の分布をみると、LiやBを除いても、元素間で100倍くらい違うので、サーベイスペクトルから元素の定量を行うのは無理があるな、場合によっては参考程度の結果しか得られないかなと思う。

f:id:AI_ML_DL:20210826110508p:plain

実測スペクトルの手動解析とDNNによる結果との比較：各ピークが何の元素化を予測しさらに各元素の濃度まで計算してある程度の値を出しているのは、なるほどなと思うところもあるが、XPSを用いて実際に分析することを考えると、今回の結果は、これはすごい、ならば、ナロースキャンを加えて定量精度を上げ、かつ、状態分析もできるようになることを期待する（あるいは自分でやってみよう）ということになる。

上側のCN/Siの分析結果でDNNが酸素を検出しているのに対して、実験結果に酸素が含まれていないのは、理解できない。測定スペクトルには酸素のピークが明瞭に認められる。本文でも、この酸素の不一致については、何も説明されていない。

f:id:AI_ML_DL:20210826125246p:plain

4. Conclusions
In conclusion, we have shown the application of a neural network to the identification and quantification task of XPS data on the basis of a synthetic random training set.

Results are encouraging, showing a detection and an accuracy comparable with standard XPS users, supporting both the training set generation algorithm and the DNN layout.

This approach can easily be scaled to different photon energies, energy resolution and data range; furthermore, theDNNcould be trained to provide more output values, such as the actual chemical shifts for each element, expanding the net sensitivity towards the chemical bonds classification.

状態分析（chemical bonds classification）から、さらには、ピーク分離（ピークフィッティング）までできるようになればよいのだが。ピーク分離には、ピーク分離の教師データが必要になるのでまた別の話になるが、同じやり方でもある程度のところまでは出来そうな気がする。

いずれにしても重要なのは、高精度なスペクトルシミュレーション技術によるシミュレーションスペクトルの蓄積である。

8月26日（木）

次はTEMかな：引用文献の数は1654。これだけ多いのは見たことがない。表紙込みで全73ページだが、35ページから、引用文献が掲載されている。

Deep learning in electron microscopy
Jeffrey M Ede, Mach. Learn.: Sci. Technol. 2 (2021) 011004

Abstract
Deep learning is transforming most areas of science and technology, including electron
microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy.

1. Introduction
Following decades of exponential increases in computational capability [1] and widespread data availability [2, 3], scientists can routinely develop artificial neural networks [4–11] (ANNs) to enable new science and technology [12–17].

1.1. Improving signal-to-noise
A popular application of deep learning is to improve signal-to-noise [74, 75], for example, of medical electrical [76, 77], medical image [78–80], optical microscopy [81–84], and speech [85–88] signals.

1.2. Compressed sensing
Compressed sensing [203–207] is the efficient reconstruction of a signal from a subset of measurements. Applications include faster medical imaging [208–210], image compression [211, 212], increasing image resolution [213, 214], lower medical radiation exposure [215–217], and low-light vision [218, 219]. In STEM, compressed sensing has enabled electron beam exposure and scan time to be decreased by 10–100× with minimal information loss [201, 202].

1.3. Labelling
Deep learning has been the basis of state-of-the-art classification [270–273] since convolutional neural networks (CNNs) enabled a breakthrough in classification accuracy on ImageNet [71].

1.4. Semantic segmentation
Semantic segmentation is the classification of pixels into discrete categories. In electron microscopy, applications include the automatic identification of local features [288, 289], such as defects [290, 291], dopants [292], material phases [293], material structures [294, 295], dynamic surface phenomena [296], and chemical phases in nanoparticles [297].

1.5. Exit wavefunction reconstruction
Electrons exhibit wave-particle duality [350, 351], so electron propagation is often described by wave optics [352]. Applications of electron wavefunctions exiting materials [353] include determining projected potentials and corresponding crystal structure information [354, 355], information storage, point spread function deconvolution, improving contrast, aberration correction [356], thickness measurement [357], and
electric and magnetic structure determination [358, 359].

2. Resources
Access to scientific resources is essential to scientific enterprise [378]. Fortunately, most resources needed to get started with machine learning are freely available.

2.1. Hardware acceleration
A DNN is an ANN with multiple layers that perform a sequence of tensor operations. Tensors can either be computed on central processing units (CPUs) or hardware accelerators [62], such as FPGAs [382–385], GPUs [386–388], and TPUs [389–391]. Most benchmarks indicate that GPUs and TPUs outperform CPUs for typical DNNs that could be used for image processing [392–396] in electron microscopy.

2.2. Deep learning frameworks
A DLF [9, 458–464] is an interface, library or tool for DNN development. Features often include automatic differentiation [465], heterogeneous computing, pretrained models, and efficient computing [466] with CUDA [467–469], cuDNN [415, 470], OpenMP [471, 472], or similar libraries.

2.3. Pretrained models
Training ANNs is often time-consuming and computationally expensive [403]. Fortunately, pretrained models are available from a range of open access collections [505], such as Model Zoo [506], Open Neural Network Exchange [507–510] (ONNX) Model Zoo [511], TensorFlow Hub [512, 513], and TensorFlow Model Garden [514].

2.4. Datasets
Randomly initialized ANNs [537] must be trained, validated, and tested with large, carefully partitioned datasets to ensure that they are robust to general use [538].

2.5. Source code
Software is part of our cultural, industrial, and scientific heritage [612]. Source code should therefore be archived where possible. For example, on an open source code platform such as Apache Allura [613], AWS CodeCommit [614], Beanstalk [615], BitBucket [616], GitHub [617], GitLab [618], Gogs [619], Google Cloud Source Repositories [620], Launchpad [621], Phabricator [622], Savannah [623] or SourceForge [624].

2.6. Finding information
Most web traffic [636, 637] goes to large-scale web search engines [638–642] such as Bing, DuckDuckGo, Google, and Yahoo. This includes searches for scholarly content [643–645]. We recommend Google for electron microscopy queries as it appears to yield the best results for general [646–648], scholarly [644, 645] and other [649] queries.

2.7. Scientific publishing
The number of articles published per year in reputable peer-reviewed [693–697] scientific journals [698, 699] has roughly doubled every nine years since the beginning of modern science [700].

3. Electron microscopy
An electron microscope is an instrument that uses electrons as a source of illumination to enable the study of small objects. Electron microscopy competes with a large range of alternative techniques for material analysis [732–734], including atomic force microscopy [735–737]; Fourier transformed infrared spectroscopy [738, 739]; nuclear magnetic resonance [740–743]; Raman spectroscopy [744–750]; and x-ray diffraction (XRD) [751, 752], dispersion [753], fluorescence [754, 755], and photoelectron spectroscopy [756, 757].

3.1. Microscopes
There are a variety of electron microscopes that use different illumination mechanisms. For example, reflection electron microscopy (REM) [759, 760], SEM [761, 762], STEM [763, 764], scanning tunnelling microscopy [765, 766] (STM), and TEM [767–769].

3.2. Contrast simulation
The propagation of electron wavefunctions though electron microscopes can be described by wave optics [136]. Following, the most popular approach to modelling measurement contrast is multislice simulation [853, 854], where an electron wavefunction is iteratively perturbed as it travels through a model of a specimen.

3.3. Automation
Most modern electron microscopes support Gatan Microscopy Suite (GMS) Software [894]. GMS enables electron microscopes to be programmed by DigitalMicrograph Scripting, a propriety Gatan programming language akin to a simplified version of C++.

4. Components
Most modern ANNs are configured from a variety of DLF components. To take advantage of hardware accelerators [62], most ANNs are implemented as sequences of parallelizable layers of tensor operations [914]. Layers are often parallelized across data and may be parallelized across other dimensions [915]. This section introduces popular non-linear activation functions, normalization layers, convolutional layers, and skip connections. To add insight, we provide comparative discussion and address some common causes of confusion.

5. Architecture
There is a high variety of ANN architectures [4–7] that are trained to minimize losses for a range of applications. Many of the most popular ANNs are also the simplest, and information about them is readily available. For example, encoder-decoder [305–308, 502–504] or classifier [272] ANNs usually consist of single feedforward sequences of layers that map inputs to outputs. This section introduces more advanced ANNs used in electron microscopy, including actor-critics, GANs, RNNs, and variational autoencoders
(VAEs). These ANNs share weights between layers or consist of multiple subnetworks. Other notable architectures include recursive CNNs [1078, 1079], network-in-networks [1141], and transformers [1142, 1143]. Although they will not be detailed in this review, their references may be good starting points for research.

6. Optimization
Training, testing, deployment and maintenance of machine learning systems is often time-consuming and expensive [1287–1290]. The first step is usually preparing training data and setting up data pipelines for ANN training and evaluation. Typically, ANN parameters are randomly initialized for optimization by gradient descent, possibly as part of an automatic machine learning (autoML) algorithm. RL is a special optimization case where the loss is a discounted future reward. During training, ANN components are often
regularized to stabilize training, accelerate convergence, or improve performance. Finally, trained models can be streamlined for efficient deployment. This section introduces each step. We find that electron microscopists can be apprehensive about robustness and interpretability of ANNs, so we also provide subsections on model evaluation and interpretation.

このレビューは、電子顕微鏡とディープラーニングに関する情報が非常に広範囲に紹介されているので、本文はもとより、引用文献も非常に重要な情報源となっている。

TEMとは直接関係ないかもしれないが、以前から気になっていたsuper-resolutionの説明が詳細になされている論文が見つかったので、読んでみる。

On the use of deep learning for computational imaging
G. BARBASTATHIS, A. OZCAN AND G. SITU, Vol. 6, No. 8 / August 2019 / Optica

Since their inception in the 1930–1960s, the research disciplines of computational imaging and machine learning have followed parallel tracks and, during the last two decades, experienced explosive growth drawing on similar progress in mathematical optimization and computing hardware.

While these developments have always been to the benefit of image interpretation and machine vision, only recently has it become evident that machine learning architectures, and deep neural networks in particular, can be effective for computational image formation, aside from interpretation.

The deep learning approach has proven to be especially attractive when the measurement is noisy and the measurement operator ill posed or uncertain.

Examples reviewed here are: super-resolution; lensless retrieval of phase and complex amplitude from intensity; photon-limited scenes, including ghost imaging; and imaging through scatter.

In this paper, we cast these works in a common framework.

We relate the deep-learning-inspired solutions to the original computational imaging formulation and use the relationship to derive design insights, principles, and caveats of more general applicability.

We also explore how the machine learning process is aided by the physics of imaging when ill posedness and uncertainties become particularly severe.

It is hoped that the present unifying exposition will stimulate further progress in this promising field of research.

1. INTRODUCTION
Computational imaging (CI) is a class of imaging systems that, starting from an imperfect physical measurement and prior knowledge about the class of objects or scenes being imaged, deliver estimates of a specific object or scene presented to the imaging system [1–7]. This is shown schematically in Fig. 1.

f:id:AI_ML_DL:20210826225543p:plain

The specific architecture of interest here is based on the neural network (NN), a multilayered computational geometry. Each layer is composed of simple nonlinear processing units, also referred to as activation units (or elements); and each unit
receives its inputs as weighted sums from the previous layer (except the very first layer, whose inputs are the quantities we wish the NN to process.) Until about two decades ago, students were advised to design NNs with up to three layers: the input layer, the
hidden layer, and the output layer. Recent progress in ML has demonstrated the superiority of architectures with many more than three layers, referred to as deep NNs (DNNs) [14–17]. Figure 2 is a simplified schematic diagram of the multi-layered DNN architecture.

f:id:AI_ML_DL:20210827064208p:plain

During the past few years, a number of researchers have shown convincingly that the ML formulation is not only computationally efficient, but it also yields high-quality solutions in several CI problems. In this approach, shown in Fig. 3, the raw intensity image is fed into a computational engine specifically incorporating ML components, i.e., multilayered structures as in Fig. 2 and trained from examples—taking the place of the generic computational engine in Fig. 1. CI problems so solved have included lensless imaging, imaging through scatter, bandwidth- or samplinglimited imaging (also referred to as “super-resolution”), and extremely noisy imaging, e.g., under the constraint of very low
photon counts.

f:id:AI_ML_DL:20210827073039p:plain

2. OVERVIEW OF COMPUTATIONAL IMAGING
A. General Formulation
Referring to Fig. 1, let f denote the object or scene that the imaging system’s user wishes to retrieve. To avoid complications that are beyond the scope of this review, we will assume that even though objects are generally continuous, a discrete representation
suffices [31–33]. Therefore, f is a vector or matrix matching the spatial dimension where the object is sampled. Light–object interaction is denoted by the illumination operator Hi, whereas the collection operator Hc models propagation through the rest of the optical system.

The output of the collection optics is optical intensity g, sampled and digitized at the output (camera) plane. After aggregating the illumination and collection models into the forward operator H = HcHi, the noiseless measurement model is g = Hf : (1). Since the measurements are by necessity discrete, g is arranged into a matrix of the appropriate dimension or rastered into a one-dimensional vector. For a single raw intensity image, g may be up to two dimensional; however, if scanning is involved (as, e.g., in computed tomography where multiple projections are obtained with the object rotated at various angles), then g must be augmented accordingly.

Uncertainty in the measurements and/or the forward operator is the main challenge in inverse problems. Typically, an optical measurement is subject to signal-dependent Poisson statistics due to the random arrival of signal photons, and additive signal-
independent statistics due to thermal electrons in the detector circuitry. Thus, the deterministic model (1) should be replaced by g = P{Hf} + T : (2), Here, P generates a Poisson random process with arrival rate equal to its argument; and T is the thermal random process often modeled as additive white Gaussian noise (AWGN). In realistic
sensors, noise may originate from multiple causes, such as environmental disturbances. For large photon counts, signal quantization is also modeled as AWGN.

B. Linear Inverse Problems, Regularization, and Sparsity
For linear forward operators H, the image is obtained by minimizing the Tikhonov [3,4] functional

f:id:AI_ML_DL:20210827104022p:plain

where || · ||2 denotes the L2 norm. The first term expresses fitness, i.e., matching in the least-squares sense the measurement to the forward model for the assumed object. The fitness term is constructed for AWGN errors, even though it is often used with more general noise models (2). The regularization parameter α expresses our relative belief in the measurement fitness versus our prior knowledge. Setting α = 0 to obtain the image from the fitness term yields only the pseudo-inverse solution, or its Moore–Penrose improvement [59,60]. The results are often prone to artifacts and seldom satisfactory, due to ill posedness in the forward operator H. To improve, the second regularizing term Φ(f) is meant to compete with the fitness term, by driving the estimate fˆ to also match prior knowledge about the class of objects being imaged.

3. OVERVIEW OF NEURAL NETWORKS
A. Neural Network Fundamentals

Classification tasks generally produce representations of much lower dimension than that of the input images; therefore, the width decreases progressively toward the output, following the contractingarchitecture in Fig. 4(a).

Up-sampling tasks, as in the image super-resolution examples that we discuss in Section 4.A, require output dimension larger than the input, so expanding architectures such as Fig. 4(b) may be considered.

The concatenation of the two is the encoder–decoder architecture in Fig. 4(c). The unit widths progressively decrease, forming a compressed (encoded) representation of the input near the waist of the structure, and then progressively increase again to produce the final reconstructed (decoded) image. In the encoder–decoder structure, skip connections are also used to transfer information directly between layers of the same width, bypassing the encoded channels.

f:id:AI_ML_DL:20210827113430p:plain

B. Training and Testing Neural Networks
The power of NNs to perform demanding computational tasks is drawn from the complex connectivity between very simple activation units. The training process determines the connectivity from examples, and can be supervised or unsupervised. The
supervised mode has generally been used for CI tasks, though unsupervised training has also been proposed [111–113]. After training, performance is evaluated from test examples that were never presented during training.

The supervised training mode requires examples of inputs u and the corresponding precisely known outputs v˜. In practice, one starts from a database of available examples and splits them to training examples, validation examples, and test examples. The training examples are used to specify the network weights; the validation examples are used to determine when to stop training; and the test examples are never to be used during the training process, only to evaluate it.

Even if the test metric is the same as the training metric, generally the two do not evolve in the same way during training. Recall that test examples are not supposed to be used in any way during training; however, the test error may be monitored and plotted as a function of training epoch t, and typically its evolution compared to the training error is as shown in Fig. 5. The reason test error begins to increase after a certain training duration is that overtraining results in overfitting: network function becomes so specific to the training examples that it damages generalization. It is tempting to use the test error evolution to determine the optimum training duration topt; however, that is not permissible because it would contaminate the integrity of the test examples. This is the reason we set aside the third set of validation examples; their only purpose is to monitor error on them, and stop training just before this validation error starts to increase. Assuming that all three sets of training, test, and validation examples have been drawn so that they are statistically representative of the class of objects of interest, there is a reasonable guarantee that topt for validation and test error will be the same.

f:id:AI_ML_DL:20210827115825p:plain

C. Weight Regularization
Overtraining and overfitting relate to the complexity of the model being learned vis-à-vis the complexity of the NN. Here, we use the term complexity in the context of degrees of freedom in a computational model [133,134]. For learning models, in particular, model complexity is known as Vapnik–Chervonenkis (VC) dimension [135–138], and it should match the complexity of the computational task. Unfortunately, the VC dimension itself
is seldom directly computable except for the simplest ML architectures.

D. Convolutional Neural Networks
Certain tasks, such as speech and image processing, are naturally invariant to temporal and spatial shifts, respectively. This may be exploited to regularize the weights through convolutional architectures [146,147]. The convolutional NN (CNN) principle limits the spatial range on the next layer, i.e., the neighborhood where each unit is allowed to influence, and make the weights spatially repeating.

E. Training Loss Functions
The most obvious TLF choices are the L2 (minimum square error, MSE) and L1 (MAE) metrics.

F. Physics Priors
Unlike abstract classification, e.g., face recognition and customer taste prediction, in CI, the input g and intended output fˆ ≈ f of the NN are related by the known physics of the imaging system, i.e., by the operator H. Physical knowledge should be useful; how then to best incorporate it into an ML engine for imaging?

One possibility is to not incorporate it at all, as depicted in Fig. 10(a).

f:id:AI_ML_DL:20210827222040p:plain

A compromise is the single-pass ML engine in Fig. 10(d). Here, an approximate inverse operator H* produces the single approximant f [0]. The single DNN is trained to receive f [0] as input and produce the image fˆ as output directly, rather than its projection onto the null space. In practice, the single-pass approach has proven to be robust and reliable even for CI problems with high ill posedness or uncertainty, as we will see in Sections 4.A(super-resolution)–4.C.

4. COMPUTATIONAL IMAGING REALIZATIONS WITH MACHINE LEARNING

The strategy for using ML for computational image formation is broadly described as follows:

(1) Obtain a database of physical realizations of objects and their corresponding raw intensity images through the instrument of interest. For example, such a physical database may be built by using an alternative imaging instrument considered accurate
enough to be trusted as ground truth; or by displaying objects from a publicly available abstract database, e.g., ImageNet [178] on a spatial light modulator (SLM) as phase or intensity; or by rigorous simulation of the forward operator and associated noise processes.
(2) Decide on an ML engine, regularization strategy, TLF (training loss function), and physical priors according to the principles of Sections 3.C–3.F, and then train the NN from the training and validation subsets of the database, as described in Section 3.B.
(3) Test the ML engine for generalization by measuring a TLF, same as training or different, for on the test example subset of the database.

A. Super-Resolution

The two-point resolution problem was first posed by Airy [179] and Lord Rayleigh [180]. In modern optical imaging systems, resolution is understood to be limited by mainly two factors: undersampling by the camera, whence super-resolution should be taken to mean upsampling; and blur by the optics or camera motion, in which case super-resolution means deblurring. Both situations or their combination lead to a singular or severely ill-posed inverse problem due to suppression or loss of entire spatial frequency bands; therefore, they have attracted significant research interest, including some of the earliest uses of ML in the CI context.

[179]. G. B. Airy, “On the diffraction of an object-glass with circular aperture,” Trans. Cambridge Philos. Soc. 5, 283–291 (1834).
[180]. L. Rayleigh, “Investigations in optics, with special reference to the spectroscope,” Philos. Mag. 8(49), 261–274 (1879).

A comprehensive review of methods for super-resolution in the sense of upsampling, based on a single image, is in [181]. To our knowledge, the first-ever effort to use a DNN in the same context was by Dong et al. [182,183]. The key insight, as with LISTA (Learned Iterative Shrinkage and Thresholding Algorithm) Review Article Vol. 6, No. 8 / August 2019 / Optica 930 (Section 3.F), was that dictionary-based sparse representations for upsampling [92,93] could equivalently be learned by DNNs. Both approaches similarly start by extracting compressed feature maps and then expanding these maps to a higher sampling rate. The difference is that sparse coding solvers are iterative; whereas, as we also pointed out in Section 1, with the ML approach, the iterative scheme takes place during training only; the trained ML engine operation is feed-forward and, thus, very fast. To combine super-resolution with motion compensation, a spatio-temporal CNN has been proposed, where, rather than simple images, the inputs are blocks consisting of multiple frames from video [184].

[92]. J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Proc. 19, 2861–2873 (2010).

[93]. J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dictionary training for image super-resolution,” IEEE Trans. Image Proc. 21, 3467–3478 (2012).

[181]. C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: a benchmark,” in European Conference on Computer Vision (ECCV)/ Lecture Notes on Computer Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds. (2014), Vol. 8692, pp. 372–386.

[182]. C. Dong, C. Loy, K. He, and X. Tang, “Learning a deep convolutional neural network for image super-resolution,” in European Conference on Computer Vision (ECCV)/Lecture Notes on Computer Science Part IV (2014), Vol. 8692, pp. 184–199.

[183]. C. Dong, C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intel. 38, 295–307 (2015).

[184]. J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4778–4787.

The ML approach to the super-resolution problem also served as motivation and testing ground for the perceptual TLF [170,171] (Section 3.E). The structure of the downsampling kernel was exploited in [177] using the cascaded ML engine architecture in Fig. 10(c) with M = 4. Figure 11 is a representative result showing the evolution of the image estimates along the ML cascade, as well as their spatial spectra. It is interesting that, by the final stage, the ML engine has succeeded in both suppressing high-frequency artifacts due to undersampling and boosting low frequency components to make the reconstruction appear smooth.

[170]. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV)/Lecture Notes on Computer Science, B. Leide, J. Matas, N. Sebe, and M. Welling, eds. (2016), vol. 9906,
pp. 694–711.
[171]. C. Ledig, L. Theis, F. Huczar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4681–4690.

[177]. M. Mardani, H. Monajemi, V. Papyan, S. Vasanawala, D. Donoho, and J. Pauly, “Recurrent generative residual networks for proximal learning and automated compressive image recovery,” arXiv:1711.10046 (2017).

Turning to inverse problems dominated by blur, early work [185] used a perceptron network with two hidden layers and a sigmoidal activation function to compensate for static blur caused by Gaussian and rectangular kernels, as well as motion blur [186].
Two years later, Sun Jiao et al. [187] showed that a CNN can learn to compensate even when the motion blur kernel across the image is non-uniform. This was accomplished by feeding the CNN with rotated patches containing simple object features, such that the network learned to predict the direction of motion.

[185]. C. J. Schuler, H. Christopher Burger, S. Harmeling, and B. Scholkopf, “A machine learning approach for non-blind image deconvolution,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013).
[186]. A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and evaluating blind deconvolution algorithms,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009).
[187]. J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).

In optical microscopy, blur is typically caused by aberrations and diffraction [188]. More than 100 years of research, tracing back to Airy and Rayleigh’s observations, have been oriented toward modifying the optical hardware—in our language, designing the illumination and collection operators—to compensate for the blur and obtain sharp images of objects down to sub-micrometer size. Thorough review of this literature is beyond the present scope; we just point out the culmination of optical super-resolution
methods with the 2014 Nobel Prize in Chemistry [189–192]. Stochastic optical reconstruction microscopy (STORM) and fluorescence photoactivation localization microscopy (PALM) for single molecule imaging [193,194] and localization [195] are
examples of co-designing the illumination operator Hi and the computational inverse to achieve performance vastly better than an unaided microscope could do.

[188]. M. Sarikaya, “Evolution of resolution in microscopy,” Ultramicroscopy 47, 1–14 (1992).
[189]. W. E. Moerner and L. Kador, “Optical detection and spectroscopy of single molecules in a solid,” Phys. Rev. Lett. 62, 2535–2538 (1989).
[190]. S. W. Hell and J. Wichmann, “Breaking the diffraction resolution limit by stimulated emission: stimulated-emission-depletion fluorescence microscopy,” Opt. Lett. 19, 780–782 (1994).
[191]. E. Betzig, “Proposed method for molecular optical imaging,” Opt. Lett. 20, 237–239 (1995).
[192]. R. M. Dickson, A. B. Cubitt, R. Y. Tsien, and W. E. Moerner, “On/off blinking and switching behaviour of single molecules of green fluorescent protein,” Nature 388, 355–358 (1997).
[193]. M. J. Rust, M. Bates, and X. Zhuang, “Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (STORM),” Nat. Methods 3, 793–796 (2006).
[194]. E. Betzig, G. H. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. S. Bonifacino, M. W. Davidson, J. Lippincott-Schwarz, and H. F. Hess, “Imaging intracellular fluorescent proteins at nanometer resolution,” Science 313, 1642–1645 (2006).
[195]. S. T. Hess, T. P. Girirajan, and M. D. Mason, “Ultra-high resolution imaging by fluorescence photoactivation localization microscopy,” Biophys. J. 91, 4258–4272 (2006).

Computationally, the blur kernel can be compensated for through iterative blind deconvolution [196,197] or learned from examples [198]. A DNN-based solution to the inverse problem was proposed for the first time, to our knowledge, by Rivenson et al. [199] in a wide-field microscope. The approach and results are summarized in Fig. 12. For training, the samples were imaged twice, once with a 40 × 0.95 NA objective lens and again with a 100 × 1.4 NA objective lens. The training goal was such that with the 40 × 0.95 NA raw images as input g, the DNN would produce estimates fˆ matching the 100 × 1.4 NA images, i.e., the latter were taken to approximate the true objects f . The number
of pixels in the high-resolution images was (2.5)^2 × the number of pixels in the low-resolution representation. Of course, the low resolution images were also subject to stronger blur due to the lower-NA objective lens. Therefore, the inverse algorithm had
to perform both upsampling and deblurring in this case. The ML engine was of the end-to-end type, as in Fig. 10(a), implemented as convolutional DNN with pyramidal progression for upsampling. The TLF was a mixture of the MSE metric (23) and a TV-like ∂2 TV [Eq. (6)] penalty. Since then, ML has been shown to improve the resolution of fluorescence microscopy [200], as well as single-molecule STORM imaging [201] and 3D localization [202].

[196]. T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen, “Blind deconvolution through digital signal processing,” Proc. IEEE 63, 678–692 (1975).
[197]. G. R. Ayers and J. C. Dainty, “Iterative blind deconvolution method and its applications,” Opt. Lett. 13, 547–549 (1988).
[198]. T. Kenig, Z. Kam, and A. Feuer, “Blind image deconvolution using machine learning for three-dimensional microscopy,” IEEE Trans. Pattern Anal. Mach. Intel. 32, 2191–2204 (2010).
[199]. Y. Rivenson, Z. Gorocs, H. Gunaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4, 1437–1443 (2017).

f:id:AI_ML_DL:20210828141235p:plain

[200]. H. Wang, Y. Rivenson, Z. Wei, H. Gunaydin, L. Bentolila, and A. Ozcan, “Deep learning achieves super-resolution in fluorescence microscopy,” Nat. Methods (2018).

f:id:AI_ML_DL:20210828140953p:plain

[201]. E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-STORM: super-resolution single-molecule microscopy by deep learning,” Optica 5, 458–464 (2018).

f:id:AI_ML_DL:20210828135717p:plain

[202]. N. Boyd, E. Jonas, H. P. Babcock, and B. Recht, “DeepLoco: fast 3D localization microscopy using neural networks,” bioRxiv.

f:id:AI_ML_DL:20210828134949p:plain

主たる目的はsuper resolutionについて知ることであった。どういうものかはわかったように思うし、あとは、目的に応じて関連文献を辿ればよさそうである。　　　　　　　とりあえず、先に進もう。

B. Quantitative Phase Retrieval and Lensless Imaging

The forward operator relating the complex amplitude of an object to the raw intensity image at the exit plane of an optical system is nonlinear. Classical iterative solutions are the Gerchberg–Saxton algorithm [203,204]; the input–output algorithm, originally proposed by Fienup [205] and subsequent variants [206–208]; and the gradient descent [209] or its variants, steepest descent and conjugate gradient [210]. This inverse problem has attracted considerable attention because of its importance in retrieving the shape or optical density of transparent samples with visible light [211,212] and x rays [213,214].

位相回復とレンズレス？何のことかわからない。この段落で引用されている文献のタイトルを眺めればなにかわかるかもしれない。

[203]. R. W. Gerchberg and W. O. Saxton, “Phase determination from image and diffraction plane pictures in electron-microscope,” Optik 34, 275–284 (1971).
[204]. R. W. Gerchberg and W. O. Saxton, “Practical algorithm for the determination of phase from image and diffraction plane pictures,” Optik 35, 237–246 (1972).
[205]. J. R. Fienup, “Reconstruction of an object from the modulus of its Fourier transform,” Opt. Lett. 3, 27–29 (1978).
[206]. J. Fienup and C. Wackerman, “Phase-retrieval stagnation problems and solutions,” J. Opt. Soc. Am. A 3, 1897–1907 (1986).
[207]. H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Phase retrieval, error reduction algorithm, and fienup variants: a view from convex optimization,” J. Opt. Soc. Am. A 19, 1334–1345 (2002).
[208]. V. Elser, “Phase retrieval by iterated projections,” J. Opt. Soc. Am. A 20, 40–55 (2003).
[209]. J. R. Fienup, “Phase retrieval algorithms: a comparison,” Appl. Opt. 21, 2758–2769 (1982).
[210]. M. R. Hestenes and E. Stiefel, “Method of conjugate gradients for solving linear systems,” J. Res. Natl. Bur. Stand. 49, 409–436 (1952).
[211]. P. Marquet, B. Rappaz, P. J. Magistretti, E. Cuche, Y. Emery, T. Colomb, and C. Depeursinge, “Digital holographic microscopy: a noninvasive contrast imaging technique allowing quantitative visualization of living cells with subwavelength axial accuracy,” Opt. Lett. 30, 468–470 (2005).
[212]. G. Popescu, T. Ikeda, R. R. Dasari, and M. S. Feld, “Diffraction phase microscopy for quantifying cell structure and dynamics,” Opt. Lett. 31, 775–777 (2006).
[213]. S. C. Mayo, T. J. Davis, T. E. Gureyev, P. R. Miller, D. Paganin, A. Pogany, A. W. Stevenson, and S. W. Wilkins, “X-ray phase-contrast microscopy and microtomography,” Opt. Express 11, 2289–2302 (2003).
[214]. F. Pfeiffer, T. Weitkamp, O. Bunk, and C. David, “Phase retrieval and differential phase-contrast imaging with low-brilliance x-ray sources,” Nat. Phys. 2, 258–261 (2006).

よくわからないが、先へ進もう。

In the case of weak scattering, the problem may be linearized through a quasi-hydrodynamic approximation leading to the transport of intensity equation (TIE) formulation [215,216]. Alternatively, if a reference beam is provided in the optical system,
the measurement may be interpreted as a digital hologram [217], and the object may be reconstructed by a computational backpropagation algorithm [218,219] (not to be confused with the back-propagation algorithm for NN training, Section 3.B.) Ptychography captures measurements effectively in the phase (Wigner) space, where the problem is linearized, by modulating the illumination with a quadratic phase and structuring it so that it is confined and scanned in either space [220–224] or angle [225–227]. Due to the difficulty of the phase retrieval inverse problem, compressive priors have often been used to regularize it in its various linear forms, including digital holography
[228,229], TIE [82,230], and Wigner deconvolution ptychography [231,232].

知らない用語がたくさん出てきて、よくわからない。

Ptychographyは、顕微鏡画像の計算方法です。関心のあるオブジェクトから散乱された多くのコヒーレント干渉パターンを処理することによって画像を生成します。その明確な特徴は並進不変性です。これは、干渉パターンが、別の定数関数に対して既知の量だけ横方向に移動する1つの定数関数によって生成されることを意味します。ウィキペディア（英語)：（タイコグラフィーと表記されているのを見たことがある。）

この段落の引用文献も列挙してみよう。何もしないよりはましだろう。

[215]. M. R. Teague, “Deterministic phase retrieval: a Green’s function solution,” J. Opt. Soc. Am. 73, 1434–1441 (1983).
[216]. N. Streibl, “Phase imaging by the transport-equation of intensity,” Opt. Commun. 49, 6–10 (1984).
[217]. J. W. Goodman and R. Lawrence, “Digital image formation from electronically
detected holograms,” Appl. Phys. Lett. 11, 77–79 (1967).
[218]. W. Xu, M. H. Jericho, I. A. Meinertzhagen, and H. J. Kreuzer, “Digital inline holography for biological applications,” Proc. Nat. Acad. Sci. USA 98, 11301–11305 (2001).
[219]. J. H. Milgram and W. Li, “Computational reconstruction of images from holograms,” Appl. Opt. 41, 853–864 (2002).
[220]. S. L. Friedman and J. M. Rodenburg, “Optical demonstration of a new principle of far-field microscopy,” J. Phys. D 25, 147–154 (1992).
[221]. B. C. McCallum and J. M. Rodenburg, “Two-dimensional demonstration of Wigner phase-retrieval microscopy in the STEM configuration,” Ultramicroscopy 45, 371–380 (1992).
[222]. J. M. Rodenburg and R. H. T. Bates, “The theory of super-resolution electron microscopy via Wigner-distribution deconvolution,” Philos. Trans. R. Soc. London A 339, 521–553 (1992).
[223]. A. M. Maiden and J. M. Rodenburg, “An improved ptychographical phase retrieval algorithm for diffractive imaging,” Ultramicroscopy 109, 1256–1262 (2009).
[224]. P. Li, T. B. Edo, and J. M. Rodenburg, “Ptychographic inversion via wigner distribution deconvolution: noise suppression and probe design,”
Ultramicroscopy 147, 106–113 (2014).
[225]. G. Zheng, R. Horstmeyer, and C. Yang, “Wide-field, high-resolution Fourier ptychographic microscopy,” Nat. Photonics 7, 739–745 (2013).
[226]. X. Ou, R. Horstmeyer, and C. Yang, “Quantitative phase imaging via Fourier ptychographic microscopy,” Opt. Lett. 38, 4845–4848 (2013).
[227]. R. Horstmeyer, “A phase space model for Fourier ptychographic microscopy,” Opt. Express 22, 338–358 (2014).
[228]. D. J. Brady, K. Choi, D. L. Marks, R. Horisaki, and S. Lim, “Compressive holography,” Opt. Express 17, 13040–13049 (2009).
[229]. Y. Rivenson, A. Stern, and B. Javidi, “Compressive Fresnel holography,”
J. Disp. Technol. 6, 506–509 (2010).
[230]. A. Pan, L. Xu, J. C. Petruccelli, R. Gupta, B. Singh, and G. Barbastathis,
“Contrast enhancement in x-ray phase contrast tomography,” Opt.
Express 22, 18020–18026 (2014).
[231]. Y. Zhang, W. Jiang, L. Tian, L. Waller, and Q. Dai, “Self-learning based
Fourier ptychographic microscopy,” Opt. Express 23, 18471–18486 (2015).
Review Article Vol. 6, No. 8 / August 2019 / Optica 941
[232]. J. Lee and G. Barbastathis, “Denoised Wigner distribution deconvolution
via low-rank matrix completion,” Opt. Express 24, 20069–20079 (2016).

引用文献の内容を１つ２つ眺めてみたが、容易には理解できない。ディープラーニングが登場すれば、式を理解できなくても、画像処理方法を利用できるようになることを期待して先に進もう。

When the linearization assumptions do not apply or regularization priors are not explicitly available, an ML engine may instead be applied directly on the nonlinear inverse problem. To our knowledge, this investigation was first attempted by Sinha et al. with binary pure phase objects [233], and subsequently with multi-level pure phase objects [234]. Representative results are shown in Fig. 13. The phase objects were displayed on a reflective SLM (spatial light modulator), and the light propagated in free space until intensity sampling by the camera. The ML engine of the end-to-end type [Fig. 10(a)] was of the convolutional DNN type with residuals. Training was carried out by drawing objects from standard databases, Faces-LFW, and ImageNet, converting each object’s grayscale intensity to a phase signal in the range (0, π), and then displaying that signal on the SLM. Because of the relatively large range of phase modulation, linearizing assumptions would have been invalid in this arrangement.

[233]. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” arXiv:1702.08516 (2017).
[234]. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4, 1117–1125 (2017).
Abstract : Deep learning has been proven to yield reliably generalizable solutions to numerous classification and decision tasks. Here, we demonstrate for the first time to our knowledge that deep neural networks (DNNs) can be trained to solve end-to-end inverse problems in computational imaging. We experimentally built and tested a lensless imaging system where a DNN was trained to recover phase objects given their propagated intensity diffraction patterns.

Retrieval of the complex amplitude, i.e., of both the magnitude and phase, of biological samples using ML in the digital holography (DH) arrangement was reported by Rivenson et al. [240]; see Fig. 14. The samples used in the experiments were from breast tissue, Papanicolaou (Pap) smears, and blood smears. In this case, the ML engine used a single-pass physics-informed preprocessor, as in Fig. 10(d), with the approximant H implemented as the (optical) backpropagation algorithm. The DNN was of the convolutional type. Training was carried out using up to eight holograms to produce accurate estimates of the samples’ phase profiles. After training, the ML engine was able, with a single hologram input, to match imaging quality, in terms of SSIM (Structural Similarity Image Measure : Section 3.E) of traditional algorithms that would have required two to three times as many holograms, and was faster as well by a factor of three to four times.

[240]. Y. Rivenson, Y. Zhang, H. Gunaydin, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl. 7, 17141 (2018).

Abstract : Phase recovery from intensity-only measurements forms the heart of coherent imaging techniques and holography. In this study, we demonstrate that a neural network can learn to perform phase recovery and holographic image reconstruction after appropriate training. This deep learning-based approach provides an entirely new framework to conduct holographic imaging by rapidly eliminating twin-image and self-interference-related spatial artifacts. This neural network-based method is fast to compute and reconstructs phase and amplitude images of the objects using only one hologram, requiring fewer measurements in addition to being computationally faster. We validated this method by reconstructing the phase and amplitude images of various samples, including blood and Pap smears and tissue sections. These results highlight that challenging problems in imaging science can be overcome through machine learning, providing new avenues to design powerful computational imaging systems.

f:id:AI_ML_DL:20210828212708p:plain

[256]. M. Deng, S. Li, and G. Barbastathis, “Learning to synthesize: splitting and recombining low and high spatial frequencies for image recovery,”
arXiv:1811.07945 (2018).

f:id:AI_ML_DL:20210829070854p:plain

C. Imaging of Dark Scenes

The challenges associated with super-resolution and phase retrieval become much exacerbated when the photon budget is tight or other sources of noise are strong. This is because deconvolutions, in general, tend to amplify noise artifacts [5]. In standard
photography, histogram equalization and gamma correction are automatically applied by modern high-end digital cameras and even in smartphones; however, “grainy” images and color distortion still occur. In more challenging situations, a variety of more sophisticated denoising algorithms utilizing compressed sensing and local feature representations have been investigated and benchmarked [257–262]. What these algorithms exploit, with varying success, is that natural images are characterized by the
prior of strong correlation structure, which should persist even under noise fluctuations that much exceed the signal. Understood in this sense, ML presents itself as an attractive option to learn the correlation structures and then recover high-resolution content from the noisy raw images.

The first use of a CNN for monochrome Poisson denoising, to our knowledge, was by Remez et al. [263]. More recently, a convolutional network of the U-net type was trained to operate on all three color channels under illumination and exposure conditions that, to the naked eye, make the raw images appear entirely dark while histogram- and gamma-corrected reconstructions are severely color distorted [169]; see Fig. 17. The authors created a see-in-the-dark (SID) dataset of short-exposure images, coupled with their respective long-exposure images, for training and testing; and used Amazon’s Mechanical Turk platform for perceptual image evaluation by humans [168]. They also report that, unlike other related works, neither skip connections in U-net nor generative
adversarial training led to any improvement in their reconstructions.

[169]. C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 3291–3300.

f:id:AI_ML_DL:20210829090709p:plain

f:id:AI_ML_DL:20210829090806p:plain

Lyu et al. [279] used deep learning with the single-pass physics-informed engine [Fig. 10(d)] and approximant H computed according to the original computational ghost imaging [273]. Due to the low sampling rate and the noisy nature of the raw measurements, the approximant reconstructions fˆ [0] were corrupted and unrecognizable. However, when these fˆ [0] were used as input to the DNN, high-quality final estimates fˆ were obtained even with sampling rates β as low as 5%, as shown in Fig. 19.

[273]. J. H. Shapiro, “Computational ghost imaging,” Phys. Rev. A 78, 061802 (2008).

[279]. M. Lyu, W. Wang, H. Wang, H. Wang, G. Li, N. Chen, and G. Situ, “Deep-learning-based ghost imaging,” Sci. Rep. 7, 17865 (2017).

f:id:AI_ML_DL:20210829092221p:plain

D. Imaging in the Presence of Strong Scattering

Imaging through diffuse media [280,281] is a classical challenging inverse problem with significant practical applications ranging from non-invasive medical imaging through tissue to autonomous navigation of vehicles in foggy conditions. The noisy statistical
inverse model formulation (2) must now be reinterpreted with the forward operator H itself becoming random. When f is the index of refraction of the strongly scattering medium itself, then H is also nonlinear. Not surprisingly, this topic has attracted
considerable attention in the literature, with most attempts generally belonging to one of two categories. The first is to characterize the diffuse medium H, assuming it is accessible and static, through (incomplete) measurement of the operator H, which in this context is referred to as transmission matrix [282–284]. The alternative is to characterize statistical similarities between moments of H. The second-order moment, or speckle correlations,
are known as the memory effect. The idea originated in the context of electron propagation in disordered conductors [285] and of course is also valid for the analogous problem of optical disordered media [286–290].

Deep learning solutions to the problem were first presented in [299] and [155], using end-to-end fully connected and residualconvolutional (CNN) architectures, respectively. Results are shown in Figs. 22 and 23. The fully connected solution [299] is motivated by the physical fact that when light propagates through a strongly scattering medium, every object pixel influences every raw image pixel in shift non-invariant fashion. However, the large number of connections creates risks of undertraining and overfitting, and limits the space-bandwidth product (SBP) of the reconstructions due to limited computational resources. On the other hand, the CNN trained with NPCC loss function [155,300], despite being designed for situations when limited range of influence and shift invariance constraints are valid, Section 3.D, does a surprisingly good job at learning shift variance—through the ReLU nonlinearities and pooling operations, presumably—
and achieves larger SBP. Both methods work well with spatially sparse datasets, e.g., handwritten numerical digits, and Latin and Chinese characters. Compared to Horisaki et al. [18], the deep architectures perform comparably well with spatially dense datasets
of restricted content, e.g., faces, and also hallucinate when tested outside their learned priors.

Non-line-of-sight (NLOS) imaging, recognition, and tracking belong to a related class of problems, because capturing details about objects in such cases must rely on scattering, typically of light pulses [301–309] or spatially incoherent light [310–313]. Convolutional DNNs have been found to be useful for improving gesture classification [314], and person identification and threedimensional localization [315]; in the latter case even with asingle-photon, single-pixel detector only.

5. CONCLUDING REMARKS

The diverse collection of ML flavors adopted and problems tackled by the CI community in a relatively brief time period, mostly since ∼2010 [104], indicate that the basic idea of doing at least partially the job of Tikhonov–Wiener optimization by DNN holds much promise. A significant increase in the rate of related publications is evident—we had trouble keeping up while crafting the present review—and is likely to accelerate, at least in the near future. As we saw in Section 4, in many cases, ML algorithms have been discovered to offer new insights or substantial performance improvements on previous CI approaches, mostly compressive sensing based, whereas in other cases, particular challenges associated with acute CI problems have prompted innovations in ML architectures themselves. This productive interplay is likely to benefit both disciplines in the long run, especially because of the strong connection they share through optimization theory and practice.

（後半省略）

AtomAI: A Deep Learning Framework for Analysis of Image and Spectroscopy Data in (Scanning) Transmission Electron Microscopy and Beyond

Maxim Ziatdinov, Ayana Ghosh, Tommy Wong and Sergei V. Kalinin

AtomAI is an open-source software package bridging instrument-specific Python libraries, deep learning, and simulation tools into a single ecosystem. AtomAI allows direct applications of the deep convolutional neural networks for atomic and mesoscopic image segmentation converting image and spectroscopy data into class-based local descriptors for downstream tasks such as statistical and graph analysis. For atomically-resolved imaging data, the output is types and positions of atomic species, with an option for subsequent refinement. AtomAI further allows the implementation of a broad range of image and spectrum analysis functions, including invariant variational autoencoders (VAEs). The latter consists of VAEs with rotational and (optionally) translational invariance for unsupervised and class-conditioned disentanglement of categorical and continuous data representations. In addition, AtomAI provides utilities for mapping structure property relationships via im2spec and spec2im type of encoder-decoder models. Finally, AtomAI allows seamless connection to the first principles modeling with a Python interface, including molecular dynamics and density functional theory calculations on the inferred atomic position. While the majority of applications to date were based on atomically resolved electron microscopy, the flexibility of AtomAI allows straightforward extension towards the analysis of mesoscopic imaging data once the labels and feature identification workflows are established/available. The source code and example notebooks are available at https://github.com/pycroscopy/atomai.

Jones R R, Hooper D C, Zhang L, Wolverson D and Valev V K 2019 "Raman techniques: Fundamentals and frontiers Nanoscale", Res. Lett. 14 1–34

Raman spectroscopy is now an eminent technique for the characterisation of 2D materials (e.g. graphene [8–10] and transition metal dichalcogenides [11–13]) and
phonon modes in crystals [14–16]. Properties such as number of monolayers [9, 12, 17, 18], inter-layer breathing and shear modes [19], in-plane anisotropy [20], doping
[21–23], disorder [10, 24–26], thermal conductivity [11], strain [27] and phonon modes [14, 16, 28] can be extracted using Raman spectroscopy.

Denoising of stimulated Raman scattering microscopy images via deep learning B. MANIFOLD, E. THOMAS, A. T. FRANCIS, A. H. HILL, AND DAN FU Vol. 10, No. 8 | 1 Aug 2019 | BIOMEDICAL OPTICS EXPRESS 3861

ラマンイメージの高画質化：1 mWの測定データを、20 mWの測定データのノイズレベルにしようとしている。

Abstract: Stimulated Raman scattering (SRS) microscopy is a label-free quantitative
chemical imaging technique that has demonstrated great utility in biomedical imaging
applications ranging from real-time stain-free histopathology to live animal imaging.
However, similar to many other nonlinear optical imaging techniques, SRS images often
suffer from low signal to noise ratio (SNR) due to absorption and scattering of light in tissue as well as the limitation in applicable power to minimize photodamage. We present the use of a deep learning algorithm to significantly improve the SNR of SRS images. Our algorithm is based on a U-Net convolutional neural network (CNN) and significantly outperforms existing denoising algorithms. More importantly, we demonstrate that the trained denoising algorithm is applicable to images acquired at different zoom, imaging power, imaging depth, and imaging geometries that are not included in the training. Our results identify deep learning as a powerful denoising tool for biomedical imaging at large, with potential towards in vivo applications, where imaging parameters are often variable and ground-truth images are not available to create a fully supervised learning training set.

f:id:AI_ML_DL:20210829140448p:plain

Rapid histology of laryngeal squamous cell carcinoma with deep-learning based stimulated Raman scattering microscopy, Lili Zhang et al., Theranostics 2019, Vol. 9, Issue 9 2541

Abstract
Maximal resection of tumor while preserving the adjacent healthy tissue is particularly important for larynx surgery, hence precise and rapid intraoperative histology of laryngeal tissue is crucial for providing optimal surgical outcomes. We hypothesized that deep-learning based stimulated Raman scattering (SRS) microscopy could provide automated and accurate diagnosis of laryngeal squamous cell carcinoma on fresh, unprocessed surgical specimens without fixation, sectioning or staining. Methods: We first compared 80 pairs of adjacent frozen sections imaged with SRS and standard hematoxylin and eosin histology to evaluate their concordance. We then applied SRS imaging on fresh surgical tissues from 45 patients to reveal key diagnostic features, based on which we have constructed a deep learning based model to generate automated histologic results. 18,750 SRS fields of views were used to train and cross-validate our 34-layered residual convolutional neural network, which was used to classify 33 untrained fresh larynx surgical samples into normal and neoplasia. Furthermore, we simulated intraoperative evaluation of resection margins on totally removed larynxes.
Results: We demonstrated near-perfect diagnostic concordance (Cohen's kappa, κ > 0.90) between SRS and standard histology as evaluated by three pathologists. And deep-learning based SRS correctly classified 33 independent surgical specimens with 100% accuracy. We also demonstrated that our method could identify tissue neoplasia at the simulated resection margins that appear grossly normal with naked eyes.
Conclusion: Our results indicated that SRS histology integrated with deep learning algorithm provides potential for delivering rapid intraoperative diagnosis that could aid the surgical management of laryngealcancer.

f:id:AI_ML_DL:20210829140827p:plain

この記事はこれをもって終了し、燃料電池と機械学習partⅡを、9月末までの期限で取り組む。

f:id:AI_ML_DL:20210717213932p:plain — style=169 iteration=500

f:id:AI_ML_DL:20210717214115p:plain — style=169 iteration=50

f:id:AI_ML_DL:20210717214247p:plain — style=169 iteration=5

2021-07-01

Kaggle散歩（2021年7月）

今月の課題：

7月1日（木）

European Gravitational Observatory - EGO：17 teams, 3 months to go

新たなコンペがスタートした。

G2Net Gravitational Wave Detection
Find gravitational wave signals from binary black hole collisions

f:id:AI_ML_DL:20210701101003p:plain — https://www.kaggle.com/c/g2net-gravitational-wave-detection/overview

Observation of Gravitational Waves from a Binary Black Hole Merger
B. P. Abbott et al. (LIGO Scientific Collaboration and Virgo Collaboration)

On September 14, 2015 at 09:50:45 UTC the two detectors of the Laser Interferometer Gravitational-Wave Observatory simultaneously observed a transient gravitational-wave signal. The signal sweeps upwards in frequency from 35 to 250 Hz with a peak gravitational-wave strain of 1.0 × 10−21. It matches the waveform predicted by general relativity for the inspiral and merger of a pair of black holes and the ringdown of the
resulting single black hole. The signal was observed with a matched-filter signal-to-noise ratio of 24 and a false alarm rate estimated to be less than 1 event per 203 000 years, equivalent to a significance greater than 5.1σ. The source lies at a luminosity distance of 410þ160 −180 Mpc corresponding to a redshift z ¼ 0.09þ0.03 −0.04 . In the source frame, the initial black hole masses are 36þ5 −4M⊙ and 29þ4 −4M⊙, and the final black hole mass is 62þ4 −4M⊙, with 3.0þ0.5 −0.5M⊙c2 radiated in gravitational waves. All uncertainties define 90% credible intervals. These observations demonstrate the existence of binary stellar-mass black hole systems. This is the first direct detection of gravitational waves and the first observation of a binary black hole merger.

2015年9月14日09：50：45UTCに、レーザー干渉計重力波観測所の2つの検出器が一時的な重力波信号を同時に観測しました。信号は、周波数が35〜250 Hzで上方にスイープし、重力波のピークひずみは1.0×10-21です。これは、ブラックホールのペアのインスピレーションと合併、および結果として生じる単一のブラックホールのリングダウンの一般相対性理論によって予測された波形と一致します。信号は、整合フィルターの信号対雑音比が24で、誤警報率が20 3000年あたり1イベント未満と推定され、5.1σを超える有意性で観測されました。光源は、赤方偏移z¼0.09þ0.03-0.04に対応する410þ160-180Mpcの光度距離にあります。ソースフレームでは、初期のブラックホールの質量は36þ5-4M⊙と29þ4-4M⊙であり、最終的なブラックホールの質量は62þ4-4M⊙であり、3.0þ0.5-0.5M⊙c2が重力波で放射されます。すべての不確実性は90％の信頼区間を定義します。これらの観測は、バイナリ恒星質量ブラックホールシステムの存在を示しています。これは、重力波の最初の直接検出であり、連星ブラックホールの合併の最初の観測です。by Google翻訳

Virgoは、イタリア、フランス、オランダ、ポーランド、ハンガリー、スペインの6か国の研究所による科学的コラボレーションの一部である。アメリカ・ワシントン州のハンフォード・サイトとルイジアナ州リビングストンにある2つのLIGO干渉計を含む、Virgoと同様の他の大型干渉計は、いずれも重力波を検出するという同じ目標を持っている。 2007年以降、VirgoとLIGOは、それぞれの検出器で記録されたデータを共有して共同で解析し、その結果を共同で発表することに同意している^[1]。干渉検出器には指向性がなく（掃天観測する）、弱くて頻度の低い1回限りのイベントの信号を探しているため、信号の妥当性を確認し、信号源の方向を推定するためには、複数の干渉計で重力波を同時に検出する必要がある。from Wikipedia

7月2日（金）

European Gravitational Observatory - EGO：45 teams, 3 months to go

PyCBCによる画像化のコードが公開されている。そこで例示されていた10件のデータのうちの6件のラベルが1であったが、そのうちの3件しか、目視で判定できなかった。

f:id:AI_ML_DL:20210702120240p:plain — From "Introduction to the Data Challenge" by Filip Morawski NCAC PAS

PyCBCを使うかどうかは別にして、画像化は必須だろうな。

コードコンペでないので、テストデータも自由に画像処理できるから、ハイスコアの競争になりそうだな。

余談になるが、スペクトルデータを数値配列のまま、簡単なNNでtrainingしてみた。いろいろやってみたが、どうにもならなかった。1本のスペクトルは4096チャンネルだから、64x64の画像に相当し、情報量としては十分だと思うが、スペクトル間の違いを認識するには、1次元配列のままではだめ、ということを実感した。

スペクトル画像およびPyCBC処理画像に変換してデータセットを作成しようかと思っていたら、すでに作って公開されている。それをEfficientNetBnで処理して、上位に顔を出している。すごいな。きれいに書かれたコードだ。

7月3日（土）

European Gravitational Observatory - EGO：81 teams, 3 months to go

ensembleだけのコードが公開された。ただし、ensembleデータの1つは公開者の予測データである。ensembleのベースになりうる自分の予測データのレベルを上げていかないと、すぐに置いていかれそうだ。

始まったばかりなのに、trainコードのチューニングもしないで、チューニングコードのチューニングで遊んでいるだけではいけないのだが、つい、チューニングゲームに夢中になる。F1で「オーバーテイク・パラダイス」や「トラフィック・バラダイス」というのが流行っているようだ。これにならって、「チューニング・パラダイス」と呼んでみようか。

7月4日（日）

European Gravitational Observatory - EGO：110 teams, 3 months to go

In this competition, you’ll aim to detect GW signals from the mergers of binary black holes. Specifically, you'll build a model to analyze simulated GW time-series data from a network of Earth-based detectors.

シミュレーションによって作成されたスペクトルを解析する、ということでは、開催・中断中の次のコンペ "SETI Breakthrough Listen - E.T. Signal Search, Find extraterrestrial signals in data from deep space"と同様である。違うのは重力波は検出されたが、地球外生命体からの信号はまだ検出されていないことである。重力波は理論的根拠が存在するが、地球外生命体からの信号には、理論的根拠が存在しない（だろうと思われる）。

検出器に現れる重力波信号のシミュレーションスペクトルに関する概略説明は、次のように記述されている。

The parameters that determine the exact form of a binary black hole waveform are the masses, sky location, distance, black hole spins, binary orientation angle, gravitational wave polarisation, time of arrival, and phase at coalescence (merger). These parameters (15 in total) have been randomised according to astrophysically motivated prior distributions and used to generate the simulated signals present in the data, but are not provided as part of the competition data.

2016年の論文で発表されたGW150914の周波数分布のデータでは、特徴的なコントラストが認められるが、そのような例は1部にすぎないと書かれている。つまり、容易に見分けがつかないから、これまでの方法よりも正確かつ高速に判定できるモデルを提案してくださいということのようだ。

さらに、重力波の研究者たちは、より高精度の機器を、地球レベルでこれまでよりも多くの場所で稼働させ、かつ、連動させて同時検出することで発生源の特定精度を増したり、検出精度を高めることによって重力波の発生原因をより正確に把握する技術開発が進められていて、ディープラーニングは、その開発には不可欠の技術要素として期待されているようである。

7月5日（月）

European Gravitational Observatory - EGO：127 teams, 3 months to go

ノイズ成分と重力波成分について勉強しよう。

Gravitational wave denoising of binary black hole mergers with deep learning

Wei Wei and E.A. Huerta, PhysicsLettersB800(2020)135081
Gravitational wave detection requires an in-depth understanding of the physical properties of gravitational wave signals, and the noise from which they are extracted. Understanding the statistical properties of noise is a complex endeavor, particularly in realistic detection scenarios. In this article we demonstrate that deep learning can handle the non-Gaussian and non-stationary nature of gravitational wave data, and showcase its application to denoise the gravitational wave signals generated by the binary black hole mergers GW150914, GW170104, GW170608 and GW170814 from advanced LIGO noise. To exhibit the accuracy of this methodology, we compute the overlap between the time-series signals produced by our denoising algorithm, and the numerical relativity templates that are expected to describe these gravitational wave sources, finding overlaps O0.99. We also show that our deep learning algorithm is capable of removing noise anomalies from numerical relativity signals that we inject in real advanced LIGO data. We discuss the implications of these results for the characterization of gravitational wave signals.

重力波の検出には、重力波信号の物理的特性と、それらが抽出されるノイズを深く理解する必要があります。ノイズの統計的特性を理解することは、特に現実的な検出シナリオでは、複雑な作業です。この記事では、ディープラーニングが重力波データの非ガウスおよび非定常性を処理できることを示し、高度なブラックホール連星GW150914、GW170104、GW170608、およびGW170814によって生成された重力波信号のノイズを除去するアプリケーションを紹介します。 LIGOノイズ。この方法論の精度を示すために、ノイズ除去アルゴリズムによって生成された時系列信号と、これらの重力波源を記述すると予想される数値相対論テンプレートとの間のオーバーラップを計算し、オーバーラップO0.99を見つけます。また、深層学習アルゴリズムが、実際の高度なLIGOデータに注入する数値相対論信号からノイズ異常を除去できることも示しています。これらの結果が重力波信号の特性評価に与える影響について説明します。Google翻訳

7月6日（火）

European Gravitational Observatory - EGO：143 teams, 3 months to go

この論文を理解しよう：

Observation of Gravitational Waves from a Binary Black Hole Merger
B. P. Abbott et al. (LIGO Scientific Collaboration and Virgo Collaboration)

Ⅰ. INTRODUCTION

重力波の存在は、1916年にアインシュタインが予言した。空間の歪の横波が光速で伝播する。--- transverse waves of spatial strain that travel at the speed of light ---

同年、Schwarzschildは場の方程式の解を発表し、1958年頃には（Finkelstein, Kruskal）その方程式がブラックホールを記述していると理解され、1963年には（Kerr）回転するブラックホールの解に一般化された。

さらに相対論的2体力学の解析的研究が進み、2005年には（Pretorius, Campanelli et al., Baker et al.,）2体ブラックホールの合体（合併）により発生する重力波を計算した結果が報告されている。Pretoriusの計算によれば、合体によって約5％の質量が失われ、それがエネルギーとして放出される。

重力波の検出は、1960年代に実験が始まった。200年代になると、日本のTAMA 300、ドイツのGEO 600、米国のLIGO、イタリアのVirgoが設置され、2002年から2011年にはこれらの合同観測が行われていた。

重力波の検出に成功したのは、LIGOであった。

Ⅱ. OBSERVATION

On September 14, 2015 at 09:50:45 UTC, the LIGO Hanford, WA, and Livingston, LA, observatories detected the coincident signal GW150914

そのとき、Virgo検出器はアップグレード中。GEO 600は十分な感度ではないかもしれないが観測モードに入っていなかった。TAMA 300は言及されていない。

日本の状況：TAMA 300の後継機KAGRAは2010年に開発が開始され、2013年には開発経過が論文発表され（Phys. Rev. D 88, 043007）、そこには2017年に稼働予定と書かれている。The construction of KAGRA started in 2010 and it is planed to start the operation of the detector at its full configuration in 2017.：2015年の大発見に間に合わなかった。

f:id:AI_ML_DL:20210706135628p:plain

GW150914は、2つのブラックホールが対になってから合体して1つのブラックホールになって、質量を減じるまでの全過程が捉えらている。一体になる直前で重力波の周波数は35 Hzから150 Hzまで高くなっている。

この波形を見てから、コンペのスペクトル（Yaroslav Isaienkov氏の公開コードから借用したラベル1のスペクトルを下に示す）を見ると、ちょっとおかしな感じがする。

GW150914のスペクトルは、スパンが0.2秒であり、下に示すスペクトルのスパンは2秒である。

GW150914データがコンペの概要説明に張り付けられていて、そのデータが開催されている論文へもリンクがはられているので、門外漢はこのデータが頭に刷り込まれる。

下に示すスペクトルはラベルが1だからGWが検出されているということだが、GW15094のGWが、8周期の間に周波数が35 Hzから150 Hzまで変化しているのに対し、下のスペクトルは、約24周期の間に周波数の変化は認められず、12 Hzであるというのは、どう理解すればよいのか、まったくわからない。

f:id:AI_ML_DL:20210706151401p:plain

Ⅲ. DETECTORS
重力波をどうやって検出するのか。マイケルソン干渉計がベースになっている。直交するアームの長さの変動をマイケルソン干渉計で検出する。LIGOは4 km、Virgoは3 kmである。

干渉検出器には指向性がなく（掃天観測する）、弱くて頻度の低い1回限りのイベントの信号を探しているため、信号の妥当性を確認し、信号源の方向を推定するためには、複数の干渉計で重力波を同時に検出する必要がある。by Wikipedia
このことは、コンペと関係あるかもしれない。H1とL1は特性が同じだから、GWが検出されていれば、GWの信号強度（ひずみの大きさ）が同じで、位相が6 msずれる、ということになるのではないだろうか。VirgoはLIGOよりも感度が悪いので、スペクトルにノイズがみられるが、位相がいくらかずれて、GWが検出されるはず。これらのスペクトルを画像化してCNNで分類するのが良さそうに思うが、3つのスペクトルをモデルにどういうふうに認識させるのが良いのだろうか。

検出器の模式図を下に示す。LIGOは2本のアームがあり、それぞれのアームは試験質量として作用する2枚の鏡からなる。検出器を通過する重力波は、アームの長さを変える。x方向とy方向のアームの長さの変化の差が、x方向とy方向に分けた光がそれぞれの方向に設置した鏡で反射されて戻ってきた光の位相の差として検出する。重力波を十分な感度で検出するために、いくつかの工夫が施されている。1つは、光共振器で300倍に増幅（増感）する。1つは、部分透過性のpower-recycling mirrorの採用で、20 Wのレーザービームが100 kWに増大する。1つは出口に設置した部分透過性signal-recycling mirrorによりアームキャビティのバンド幅の拡幅による重力波信号抽出の最適化。

f:id:AI_ML_DL:20210706234158p:plain

右上に示されている振幅スペクトル密度amplitude spectral densityは、重力波ひずみ振幅gravitational-wave strain amplitude換算で表示されている。ショットノイズのような縦線が多く見られるが、キャリブレーション用、test massの吊り下げ線の振動モード、交流電源などに起因しているものも含まれている。

重力波がest massに作用して発生するひずみをキャリブレーションするために、レーザービームによる光子圧力photon pressureを用いる。このキャリブレーションレーザーを用いて、シミュレーションにより作成した重力波の波形を模擬的に発生させてテストを行っている。

環境からのかく乱をモニターするために多くのセンサーを配置し、かつ、正確に同期させて情報を収集し集約している。

Ⅳ. DETECTOR VALIDATION

GW150914の観測の前後数時間における定常観察中に観測された信号は、感度においても、一時的なノイズにおいても、その他の期間中に観測された状況との違いは認められなかった。このことを証明するために膨大な作業を行ったと推察される。

GW150914で検出された信号と比べると、外部要因で発生する可能性のある信号強度は6％以下であることを確認している。

さらに、2つのサイト（HanfordとLivingston）で、外部かく乱により、疑似信号が同時に検出される可能性もないことを確認した。

Ⅴ SEARCHES

2つの方法で重力波の信号探索を行った。１つは、相対性理論から予測される重力波の波形を用いたフィルタリング、1つは、一般的な一時的な信号で、最小限の仮定の下に作成した波形を探すという方法である。独立に探索した結果、両法でbinary black hole mergerからの強い信号を探し当てることができた。さらに2つの観測地点の結果が、観測場所の距離に相当する時間差をもって一致した。

バックグラウンド信号の評価、ノイズの評価も非常に複雑で難しく、多大な労力を要する作業のようである。

（自己流の検討：GW150914の信号は約0.2秒間である。1時間の測定データから単純に探し出すにしても、18000枚のスペクトル画像から1枚の候補を探し出すことになる。それも、0.2秒のスペクトルが途中で途切れていると正しく拾い上げることはできないので、実際には、0.02秒ステップぐらいで時間をずらしながら、あるかないかわからないスペクトルを見つけ出す作業となる。）

7月8日（木）

論文読みの続き：

A. Generic transient search

時間周波数形態time-frequency morphologyをベースに排他的な3種類に分ける。

C1：noise transient：1次的に大きくなったノイズ

C2：all remaining event：C1とC3以外のすべて

C3：events with frequency that increases with time：周波数が時間とともに増大：G150914はC3に該当する。

GW170608のtime-frequencyとtime-seriesを示す。G150914と同様に、時間‐周波数画像中に周波数変化に特徴的なコントラスト（カーブした明るい領域）が見られる。

f:id:AI_ML_DL:20210708111428p:plain — Gravitational wave denoising of binary black hole mergers with deep learning, WeiWeia and E.A.Huertaa

コンペでも、C3に分類される重力波が検出されているスペクトルはtime-frequency画像を用いると、比較的容易に分類できているようである。

f:id:AI_ML_DL:20210708121329p:plain

C2とC3の識別、C1と（C1+C2）の識別に関する記述が見当たらないので、引用文献を調べてみよう。

Observing gravitational-wave transient GW150914 with minimal assumptions

B. P. Abbott et al., arXiv:1602.03843v2 [gr-qc] 22 Aug 2016

著者は約1000名の連名となっている。

Ⅱ. DATA QUALITY AND BACKGROUND ESTIMATION

"time-shift" methodがbackgroundを見積もるには効果的。

Ⅲ. SEARCHES FOR GRAVITATIONAL WAVE BURSTS

coherent Waveburst (cWB)

omicron-LALInference-Bursts (oLIB)

BayesWave

これら3つのアルゴリズムは、探索の戦略が異なる。

A. Coherent WaveBurst (cWB)

20004年から、LIGO, Virgo, およびGEOのデータ解析に用いられていた。GW150914では、データ取得から3分後には結果を得ていた。低遅延条件では16-2048 Hzの帯域のデータを解析し、オフラインでは16-1024 Hzの帯域のデータを解析している。

1. cWB pipeline overview

広い範囲の重力波の波形を対象にしていて、特定の波形を前提にした解析方法ではない。2つのLIGO検出器の同期事象を識別し、最尤解析法により重力波の波形を再構築しする。

the data are whitened and converted to the time-frequency domain using the Wilson-Daubechies-Meyer wavelet transfoem.

Data from both detectors are then combined to obtain a time-frequency power map.

白色化：pycbcにwhiten( )がある。

時間-周波数ドメイン：GW150914では、32-512 Hzだが、コンペはどの周波数の重力波をシミュレーションスペクトルとして追加したのかは、不明である。

検出器を結び付ける：これは、公開コードを見ると、1つの図にまとめているようである。重ね合わせるのか、並列にするのか、両方を比較してみよう。

7月9日（金）

European Gravitational Observatory - EGO：
205 teams
, 3 months to go

重力波検出器の概要を学んだ。

重力波検出器で測定したスペクトルの99％はノイズで、その中から重力波を探し出す過程の概略を学んだ。

2体ブラックホールの合体の過程で放出される重力波の計算過程を学んだ。

ここからは、コンペサイトのコードで、どのように信号処理して重力波の有無を予測するモデルを構築しているのかを学ぼう。

7月10日（土）

European Gravitational Observatory - EGO：213 teams, 3 months to go

スペクトルの画像化は、Q-transformが主流となっている。

重力波のシミュレーション条件が明かされていないので、周波数とその分布、時間変化などが不明なので、広めに設定しておくということになるのだろうか。

最終的に提出するモデルに関する注意書き：

Your final score may not be based on the same exact subset of data as the public leaderboard, but rather a different private data subset of your full submission — your public score is only a rough indication of what your final score is.

You should thus choose submissions that will most likely be best overall, and not necessarily on the public subset.

コンペが始まってまだ10日くらいしか経っていない。現在のトップは0.875だが、最終スコアは0.95くらいになるのだろうか。

さて、これから、どうしようか。

１つは、pycbcを使えるようになって、重力波を理解することかな。

Wikipediaによれば、

PyCBCは、主にPython プログラミング言語で記述されたオープンソースのソフトウェアパッケージであり、重力波天文学および重力波データ分析で使用するために設計されています。[1] PyCBCには、重力波データ分析で一般的な他のタスクの中でも、信号処理、FFT、整合フィルタリング、重力波形生成のためのモジュールが含まれています。[1]

このソフトウェアは、重力波データを分析し、重力波を検索し、天体物理学的ソースの特性を測定するために、LIGOおよびVirgoの科学者と一緒に重力波コミュニティによって開発されています。LIGOとVirgoの観測所からの重力波データを分析して、中性子星[2]とブラックホール[3] [4] [5] [6]の合体からの重力波を検出し、それらの統計的有意性を決定するために使用されています。。[7] PyCBCベースの分析は、大規模なコンピューティングリソースのためにOpen ScienceGridと統合できます。[8]PyCBCに基づくソフトウェアは、天文学的な追跡のために重力波データを迅速に分析するために使用されてきました。

GitHubにチュートリアルがあるので、そこを利用してみよう。

gwastro/PyCBC-Tutorials

How to access LIGO data
How to do some basic signal processing
Data visualization of LIGO data in time-frequency plots
Matched filtering to extract a known signal

7月11日（日）

European Gravitational Observatory - EGO：220 teams, 3 months to go

公開コードのいくつかは、CQT1992v2を使っているので、CQTについて調べてみた。2020年の論文：nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks：

I. INTRODUCTION

SPECTROGRAMS, as time-frequency representations of audio signals, have been used as input for neural network models since the 1980s [1–3]. Different types of spectrograms are tailored to different applications. For example, Mel spectrograms and Mel frequency cepstral coefficients (MFCCs) are designed for speech-related applications [4, 5], and the constant-Q transformation is best for music related applications [6, 7]. Despite recent advances in end-to-end learning in the audio domain, such as WaveNet [8] and SampleCNN [9], which make model training on raw audio data possible, many recent publications still use spectrograms as the input to their models for various applications [10].（注）CQTは、constant-Q transformationの略である。

解析対象は、オーディオ信号である。オーディオ信号の定義は、Wikipediaによれば、

An audio signal is a representation of sound, typically using either a changing level of electrical voltage for analog signals, or a series of binary numbers for digital signals. Audio signals have frequencies in the audio frequency range of roughly 20 to 20,000 Hz, which corresponds to the lower and upper limits of human hearing. Audio signals may be synthesized directly, or may originate at a transducer such as a microphone, musical instrument pickup, phonograph cartridge, or tape head. Loudspeakers or headphones convert an electrical audio signal back into sound.

重力波の解析なので、PyCBCのチュートリアル； "How to do some basic signal processing" を眺めてみよう。

最初に、GW150914を含む32秒間の測定データ：

f:id:AI_ML_DL:20210711130701p:plain

次は、GW150914を含む1秒間の測定データ：10 Hz程度の低い周波数のシグナルだが、この周波数のノイズが最も大きいノイズのようだ。

コンペで提供されているスペクトルと非常に良く似ている。

f:id:AI_ML_DL:20210711131117p:plain

低周波のノイズを除去するためにハイパスフィルターを適用する。

high_data = data[ifo].highpass_fir(15, 512) # Highpass point is 15 Hz

f:id:AI_ML_DL:20210711131850p:plain

ここでパワースペクトル密度（power spectral density : PSD）の概念が導入される。ノイズパワーがどの周波数でどのように変化するかを知ることは重要。低周波数や特定の周波数で大きなノイズパワーが発生する。60 Hzの電源ノイズ、吊るしたミラーのバイオリンモード、種々の装置の共鳴によるものなどがある。

f:id:AI_ML_DL:20210711140136p:plain

ここでデータの白色化を行う。ノイズからの偏差を可視化するために、ある周波数領域でデータを白色化するのは効果的である。過剰量がゼロからの偏差として見えるようになる。白色化によってパワースペクトル密度が平坦化される。全ての周波数の寄与が等価になる。白色化してからバンドパスフィルターをかけることによって特定の周波数領域が可視化される。

f:id:AI_ML_DL:20210711142318p:plain

GW150914のデータ付近に30-250 Hzのバンドパスフィルターをかけることにより、この現象と無関係の周波数領域が除去される。

f:id:AI_ML_DL:20210711144631p:plain

着目する領域を拡大する。

f:id:AI_ML_DL:20210711150939p:plain

検出装置間の時間差7 msを補正することによって、H1とL1のスペクトルは一致する。

f:id:AI_ML_DL:20210711151004p:plain

Q変換プロットによりデータ中の過剰領域をの可視化する。CQTとして知られている時間-周波数表現を用いて、重力波データを可視化するためによく用いられる方法。

f:id:AI_ML_DL:20210711151032p:plain

f:id:AI_ML_DL:20210711151054p:plain

CQT1992v2を使ってみた。

想定しているサンプリング周波数は、数十kHzのようである。重力波検出器のサンプリング周波数は2048 Hzで、1桁違う。そのためか、Q変換後の画像の形状や解像度を自由に変えることはできなかった。もう少し調べてみよう。

pycbcはチュートリアルである程度理解できたと思って使ってみたが、whiton、grangeなどは、詳細にふみこまないと、設定できない。公開コードで使われている条件を使えば、論文に掲載されているような画像は得られるが、パラメータの意味を理解してからでないと・・・。

7月12日（月）

European Gravitational Observatory - EGO：236 teams, 3 months to go

CQT1992v2の検討：

sr, hop_length, fmin, fmax, n_bins, bins_per_octave, norm, window, center, pad_mode, trainable, output_format, verbose, device

これらのパラメータをプログラム上で変更しながら効果を調べた。調べられる範囲で。

bins_per_ovtaveを変更すると、波数軸の解像度が変わっているように見えた。

hop_lengthを変更すると時間軸の解像度が変わっているように見えた。デフォルトの512は無意味に見えた。

波数分布図の見かけの解像度を上げて、efficientnetを走らせてみたが、時間がかかるので途中でやめた。KaggleのGPUでは時間内にtrainingが終わらないような感じなので、解像度を上げることで予測精度が向上するようであれば、TPUを使うことを検討する必要がある。

ラベル1の画像：重力波が検出されている画像を何十枚か眺めてみたが、3枚の画像すべてけ検出されていると目視で判定できる画像は10％もないように思う。そんな易しい課題である筈はないと思うのだが、どうすれば良いのか、わからない。

7月13日（火）

European Gravitational Observatory - EGO：247 teams, 3 months to go

スコアが上がらないな。GW150914の重力波検出に始まって数多くの重力波が検出されている。このコンペは、埋もれた信号を、データサイエンスの力でどこまで掘り出せるかを競うものであり、主催者の目的が明確で、挑戦的な内容になっているような気がしている。Q変換画像からヒトの目で検知できるものは、予測結果が1.0に集中し、そこから急激に減衰し、0.8くらいで最小値になったあと徐々に高くなって、0.2付近にブロードなピークを持つ。0.7から1.0の間の分布が少なすぎる（小さすぎる）ように思う。

（上に転記した）pycbcのチュートリアルに示されているような手順にしたがって丁寧に解析し、シミュレーションによって作成した様々な重力波の波形を用いて波形探索を行い、重力波受信領域を絞り込み、絞り込んだ領域について、さらに詳細な解析をしてはじめて明らかになるような重力波の検出プロセスを、機械学習モデルによって簡略化することが求められているのだろう。

重力波が検出されているかどうかの判定：15 Hz以下の低周波成分の除去：ローパスフィルターとハイパスフィルターの組み合わせ：3つの検出器で、ある時間差以内に、同時検出されている事：検出されている≡周波数変換画像において特定の領域の強度が高い：

EfficientNetのサイズによって予測精度はどのくらい違うのか、調べてみた。

B1からB7まで変えたとき、val_aucは、およそ、0.861から0.863まで変化した。B7の計算時間はB1の2.8倍かかった。サイズ効果はほぼ単調増大であったが、こういうのは非常に珍しい。モデルサイズだけ変えるのは容易なので、時間さえあれば、他のパラメータを固定して、計算させてみるのだが、このように、val_aucがモデルサイズに対して単調に増大するのは珍しい。imagenetやciferのようにデータ量が膨大な場合にこのようになるようだ。このコンペもデータ量は多い方だと思う。

7月15日（木）

Radiological Society of North America：50 teams, 3 months to go

RSNA-MICCAI Brain Tumor Radiogenomic Classification
Predict the status of a genetic biomarker important for brain cancer treatment

multi-parametric MRI (mpMRI)の画像からMGMT promoter methylationが生じているかどうかを判定するもの。データサイズは大きいが、訓練画像は585セットしかない。1セットの画像は、4種類のMRI画像からなる。4種類の画像セット全体を使って訓練することはできるのか。1種類毎に訓練/予測して最後に多数決をとるとか、平均をとるとか。

公開コードの1つは、1組の画像から何枚か等間隔で抜き出した画像を重ね合わせて2次元化（平均化）した画像を使っているようである。それでもある程度のスコアが出ているところが面白い。重ね合わせるかわりに、並べてみるのも面白そうだ。

汎用的なモデルに当てはめようとせずに、病巣の特徴を抽出するためにどうすれば良いのかを、いろいろ、考えてみよう。

7月16日（金）

RSNA-MICCAI Brain Tumor Radiogenomic Classification：65 teams, 3 months to go

このコンペに関する論文がある。

U.Baid, et al., “The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification”, arXiv:2107.02314, 2021.

セグメンテーションと分類（MGMT）の2つのテーマが走っていて、Kaggleが後者のテーマを受け持っている。

＊＊＊ペースダウン＊＊＊

8月1日（日）

毎日、Kagglerやってたけど、2週間くらい前から状況が変化してきていて、これからは、土日Kagglerになると思う。コンペへの参加の仕方も変わってくるだろうと思う。

結果を求めることよりも、データの前処理技術を向上することや、新規な予測モデルを作ることなどに力を入れていきたいと思っている。

・・・2021年7月のKaggle散歩は本日をもって終了する！・・・

10月19日にKaggleから届いたメール

Congratulations!

You received a Silver medal for 'RSNA-MICCAI Brain Tumor Radiogenomic Classification'

すっかり忘れていたので驚いた。

リーダーボードを見ると、確かに、トップ5％以内に入っている。

public leaderboardでは、下位（70％くらい）であった。

自分の投稿した結果を見ると、Public スコアは大きく違わないにもかかわらず、2件以外は、Privateスコアが-1.000となっている。

overfittingにならないように注意した結果だと言えなくもないが、運が良かっただけのような気もする。

f:id:AI_ML_DL:20210701010117p:plain — style=168 iteration=500

f:id:AI_ML_DL:20210701010256p:plain — style=168 iteration=50

f:id:AI_ML_DL:20210701010409p:plain — style=168 iteration=5

2021-06-09

Kaggle散歩（2021年6月）

不特定コンペ

6月9日（水）

CommonLit：1,913 teams, 2 months to go

Rate the complexity of literary passages for grades 3-12 classroom use

3-12 classroomの子供たちにとって読みやすいかどうか、読みやすさの程度を推測するモデルを競うもの。機械学習において難問のように思うが、その前に、学校の教師、教育関係者、研究者にとっても評価のわかれる課題のように思う。まだ少し見ただけだが、Kagglerたちのdiscussionも白熱しているようだ。

予測モデルの一例：

https://huggingface.co/transformers/model_doc/roberta.html：

以下はこのサイトから引用

Transformers
State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow

RoBERTa
Overview
The RoBERTa model was proposed in

RoBERTa: A Robustly Optimized BERT Pretraining Approach

by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

The abstract from the paper is the following:

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

自然言語処理はあまりやっていないので少しずつ習得しよう。

Coleridge Initiative：1,447 teams, 14 days to go

Coleridge Initiative - Show US the Data, Discover how data is used for the public good

In this competition, you'll use natural language processing (NLP) to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas gathered from CHORUS publisher members and other sources, you'll identify data sets that the publications' authors used in their work.

論文から、情報を抽出する技術？

抽出した情報の正確さや、実験結果の解釈の正確さや、式の導出の正確さなども含めて評価できれば、論文の査読システムとして、論文作成者の手助けとして、役に立つかもしれないが・・・。

これも自然言語処理なので少しずつ習得しようと思っている。

Berkeley SETI Research Center：854 teams, 2 months to go

SETI Breakthrough Listen - E.T. Signal Search
Find extraterrestrial signals in data from deep space

Search for ExtraTerrestrial Intelligence：地球外知的生命体の探索

面白そうだな。

6月10日（木）

Berkeley SETI Research Center：873 teams, 2 months to go

信号を受信する方向を交互に切り替えれば、特定の方向からなにがしかの信号が放出されていると推定され、その信号を発している主体を、地球外知的生命体とみなすということのようである。

上側の画像は実際にボイジャーからの信号を受け取った物で、ボイジャーの方向と外した方向からの信号を受信している。

下側の画像は、地球外知的生命体が発する信号を模擬したときに検出されるであろう信号を模擬したものである。

そんなに難しい課題ではないように思う。

Coleridge Initiative：1,478 teams, 12 days to go

公開コードをコピーして、submitした。これから、コードの内容を学んで、スコアアップにもっていきたい。

関係ないかもしれないが、Coleridge Initiativeで検索したら、こんな論文があ,った。

Harvard Data Science Review • Issue 3.2, Spring 2021
Enhancing and Accelerating Social Science Via Automation: Challenges and Opportunities

Tal Yarkoni, Dean Eckles, James A. J. Heathers, Margaret C. Levenstein, Paul E. Smaldino, Julia Lane Published on: Apr 30, 2021 DOI: 10.1162/99608f92.df2262f5

6月11日（金）

Coleridge Initiative：
1,483 teams
12 days to go

The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset.

引用していると推測される、出版物に掲載されているデータセットを特定して、短い抜粋として記述する。複数のデータセットがあれば、区切り記号を使って示す。

Transformersライブラリーに学ぶ

次のようなタスクに分けられている。文献中で使われているデータセットを抽出するにはどうすればよいのだろうか。

・Sentiment analysis: is a text positive or negative?

・Text generation (in English): provide a prompt and the model will generate what follows.

・Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)

・Question answering: provide the model with some context and a question, extract the answer from the context.

・Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.

・Summarization: generate a summary of a long text.

・Translation: translate a text in another language.

・Feature extraction: return a tensor representation of the text.

要約を作ってみてそこにデータセットの情報が含まれるかどうか調べる。

データセットは何を使っていますか？と聞いて答えさせる。

質問の仕方によって答えは違ってくるかもしれないが、試してみる価値はありそう。

この文献は＊＊＊＊＊を参考にして書かれています。という文章の＊＊＊＊＊の部分の穴埋めをさせてみる。

固有表現抽出の方法（まだ理解できていないが）を使って、単語以上、文章以下の単語のクラスター（句）が表すエンティティ―のラベルを付けて、データセットに相当するものを引き出す。

dataset_titleと類似した句を抽出する。

明日は、Transformersライブラリーの学習を進めるとともに、Transformersライブラリーを使っている公開コードを探し、あれば、それに学びながら、スコアアップの方法を探っていこう。

6月12日（土）

Coleridge Initiative：1,495 teams, 11 days to go

BERTの論文：

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, Google AI Language arXiv:1810.04805v2 [cs.CL] 24 May 2019

Abstract
We introduce a new language representation model called BERT, which stands for
Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

トランスフォーマーからの双方向エンコーダー表現を表すBERTと呼ばれる新しい言語表現モデルを紹介します。最近の言語表現モデル（Peters et al。、2018a; Radford et al。、2018）とは異なり、BERTは、すべてのレイヤーで左右両方のコンテキストを共同で調整することにより、ラベルのないテキストから深い双方向表現を事前トレーニングするように設計されています。その結果、事前にトレーニングされたBERTモデルを1つの追加出力レイヤーで微調整して、質問応答や言語推論などの幅広いタスク用の最先端のモデルを作成できます。タスク固有のアーキテクチャを大幅に変更する必要はありません。by Googlr翻訳

A.3 Fine-tuning Procedure
For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks:
• Batch size: 16, 32 　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 • Learning rate (Adam): 5e-5, 3e-5, 2e-5
• Number of epochs: 2, 3, 4

このファインチューニング条件が目にとまり、かつ、ここに転載する理由は、開催中のCommonLitコンペにおいて、でtransformerモデル（model = AutoModelForSequenceClassification.from_pretrained）を用いたtrainingのハイパーパラメータが、ほぼ、この条件に合っていて、これを外すとトレーニングがうまくいかなったことがあったからである。

transformers v.4.6.0:

Sequence Classification:

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
>>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

Extractive Question Answering:

>>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
>>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

Language Modeling:

Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with causal language modeling.

Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or on scientific papers e.g. LysandreJik/arxiv-nlp.

Masked Language Modeling:

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
>>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

Causal Language Modeling:

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelWithLMHead.from_pretrained("gpt2")

Text Generation:

>>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
>>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

Named Entity Recognition:

>>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Summarization:

Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a model on a summarization task, you may leverage the run_summarization.py script.

>>> model = AutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

Translation:

Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a translation task, you may leverage the run_translation.py script.

>>> model = AutoModelWithLMHead.from_pretrained("t5-base")
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

>>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
>>> outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

>>> print(tokenizer.decode(outputs[0]))
Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.

ゼロからコードを書くことができればよいのだが、そうはいかないので、公開コードを探そう。なかなか良さそうなのがみつからない。

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer, Facebook AI arXiv:1910.13461v1 [cs.CL] 29 Oct 2019

BERTと間違ったのかと思ったら、BARTというのも別にあって驚いた、というくらいこの分野には疎い。

In this paper, we present BART, which pre-trains a model combining Bidirectional and Auto-Regressive Transformers. BART is a denoising autoencoder built with a sequence-to-sequence model that is applicable to a very wide range of end tasks. Pretraining has
two stages (1) text is corrupted with an arbitrary noising function, and (2) a sequence-to-sequence model is learned to reconstruct the original text. BART uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes (see Figure 1).

この論文では、双方向トランスフォーマーと自己回帰トランスフォーマーを組み合わせたモデルを事前トレーニングするBARTを紹介します。 BARTは、シーケンス間モデルで構築されたノイズ除去オートエンコーダです。
非常に幅広いエンドタスクに適用できます。事前トレーニングには2つの段階があります（1）テキストは任意のノイズ処理機能で破損し、（2）シーケンス間モデルは元のテキストを再構築するために学習されます。 BARTは、標準のTranformerベースのニューラル機械翻訳アーキテクチャを使用します。これは、その単純さにもかかわらず、BERT（双方向エンコーダーによる）、GPT（左から右へのデコーダーを使用）、およびその他の多くの最近の事前トレーニングスキームを一般化したものと見なすことができます。（図1を参照）by Google翻訳

f:id:AI_ML_DL:20210612202405p:plain

とりあえず、こんな感じ。　GPT --> BERT --> BART

6月13日（日）

CommonLit：2,031 teams, 2 months to go

CommonLiコンペはテキストの一節からテキストの読みやすさを数値化する（3学年から12学年の範囲内に該当するか否かをRMSE値で評価する）。

モデルはbert, roberta-base, roberta-largeなど。overfittしやすく、最適エポック数は少ない。エポック数では調節しずらく、ステップ数で評価している人もいるようだ。

現時点では公開コードを借用すれば銀～銅メダル圏内だが、現状の公開コードをそのまま使っただけでは、8月2日の最終日までには、追い出されているだろうな。それまでに、公開コードに学んでスコアアップの手段を探さないとだめだろうな。

Coleridge Initiative：1,514 teams, 10 days to go

今見ている公開コードにliteral matchingという項目がある。単純に、文字合わせ、ということだろうか。

6月14日（月）

Google：443 teams, 2 months to go

Google Smartphone Decimeter Challenge
Improve high precision GNSS positioning and navigation accuracy on smartphones

Global Navigation Satellite System (GNSS) provides raw signals, which the GPS chipset uses to compute a position. Current mobile phones only offer 3-5 meters of positioning accuracy. While useful in many cases, it can create a “jumpy” experience. For many use cases the results are not fine nor stable enough to be reliable.

グローバルナビゲーション衛星システム（GNSS）は、GPS チップセットが位置を計算するために使用する生の信号を提供します。現在の携帯電話は、3〜5メートルの測位精度しか提供していません。多くの場合便利ですが、「ジャンピー」な体験を生み出すことができます。多くのユースケースでは、結果は良好ではなく、信頼できるほど安定していません。by Google翻訳

In this competition, you'll use data collected from the host team’s own Android phones to compute location down to decimeter or even centimeter resolution, if possible. You'll have access to precise ground truth, raw GPS measurements, and assistance data from nearby GPS stations, in order to train and test your submissions.

このコンテストでは、ホストチームのAndroid スマートフォンから収集したデータを使用して、可能であればデシメートルまたはセンチメートルの解像度まで位置を計算します。提出物をトレーニングおよびテストするために、正確なグラウンドトゥルース、生のGPS測定値、および近くのGPSステーションからの支援データにアクセスできます。by Google翻訳

Fast Kalman filters in Python leveraging single-instruction multiple-data vectorization. That is, running n similar Kalman filters on n independent series of observations. Not to be confused with SIMD processor instructions.

カルマンフィルターは、離散的な誤差のある観測から、時々刻々と時間変化する量（例えばある物体の位置と速度）を推定するために用いられる。レーダーやコンピュータビジョンなど、工学分野で広く用いられる。例えば、カーナビゲーションでは、機器内蔵の加速度計や人工衛星からの誤差のある情報を統合して、時々刻々変化する自動車の位置を推定するのに応用されている。カルマンフィルターは、目標物の時間変化を支配する法則を活用して、目標物の位置を現在（フィルター）、未来（予測）、過去（内挿あるいは平滑化）に推定することができる。by Wikipedia

内容を理解するために、何日もかかりそう。

TPU仕様の公開コードを借りて、計算してみた。公開コードのスコアはLB=6.752なので、そのまま走らせて、同程度のスコアになることを確認した。

次に、GPUで走らせてみた。TensorFlow/Kerasで書かれているので、TPUを指定するコードは少なく、容易に変更できる。しかし、スピードが違う。TPU前提なので5 Foldで、エポック数が最大100に設定されている。とてもでないがGPUでは無理。と思ってテストのつもりで3エポックに設定してみたら、LB=0.676になった。それではと、6エポックに増やしたら、NNの学習は進んだはずなのだが、LB=0.701となってしまった。コードを理解しながら前進しよう。

6月15日（火）

Berkeley SETI Research Center：942 teams, a month to go

Image Size vs Scoreというタイトルで情報のやりとりが行われている。画像解像度が高いほどスコアは高い傾向にある。コードコンペではないので、使える計算資源による差が現れやすい。What's your best single model?ここでも画像サイズが話題になり、大きな画像で良いスコアだが、Kaggle kernelでは動かないという話。最後には、アンサンブルの話。

Coleridge Initiative：1,521 teams, 7 days to go

このコンテストでは、自然言語処理（NLP）を使用して、出版物で科学データがどのように参照されているかを自動的に検出します。CHORUS出版社のメンバーやその他の情報源から収集された多数の研究分野からの科学出版物の全文を利用して、出版物の著者が彼らの仕事で使用したデータセットを特定します。by Google翻訳

1対の、train_dataの全文に目を通して、train_labelがどこに、どのように書かれているかを調べてみた。

test_dataを読んで、公開コードの推測結果が、求めているデータセットに該当するかどうか調べてみた。ANDIはデータセットではないように思う。BERT-MLMによる推測結果は、Lothian Birth Cohort Studyで、研究成果ではあるがデータセットと呼べるようなものでもなさそうだ。難しい。

6月16日（水）

Major League Baseball：148 teams, 2 months to go

MLB Player Digital Engagement Forecasting
Predict fan engagement with baseball player digital content

engagementが何を意味しているのかが、わからない。

In this competition, you’ll predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new　insights into what signals most strongly correlate with and influence engagement.

このコンテストでは、ファンがMLBプレーヤーのデジタルコンテンツを将来の日付範囲で毎日どのように利用するかを予測します。プレーヤーのパフォーマンスデータ、ソーシャルメディアデータ、市場規模などのチーム要因にアクセスできます。成功したモデルは、どのシグナルがエンゲージメントと最も強く相関し、影響を与えるかについての新しい洞察を提供します。by Google翻訳

engage, engagementの意味がよくわからない。

6月17日（木）

Google Smartphone Decimeter Challenge：474 teams, 2 months to go

Our team is currently using only post processing to improve the accuracy. We have found that the order of post processing changes the accuracy significantly, so we share the results.

ポストプロセスに関する手順と効果に関するDiscussionが行われている。

6月18日（金）

Berkeley SETI Research Center：1,001 teams, a month to go

There are just a few of us data scientists at Kaggle launching about 50 competitions a year with many different data types over a very wide range of domains. Worrying about leakage and other failure points keeps us up at night. We absolutely value our community's time and effort and know how important it is to have fun and challenging competitions.

残念なことに、リークがあったようだ。 Kaggle staffは、このようなトラブルが生じないように少ないメンバーで日夜頑張っておられるのだ！

もうすぐ消されると思うが、LBには、1.000が45件、0.999が75件ほど並んでいる。タイミングが合えば、自分もそこに並ぼうとしただろうな。残念なことだが。

いち早くリークに気付いて公開し、Kaggleスタッフと協力して原因究明にあたろうとするKagglerたちがいる。こういう人たちがKaggleを支えているのだろう。

6月19日（土）

Major League Baseball：194 teams, a month to go

このコンペは通常のコンペとはかなり異なるように思う。それが顕著に分かるのは、Timelineである。Training TimelineとEvaluation Timelineに分かれている。

Training Timelineにおいては、Final Submission Deadlineの約1.5週間前に、Training Setが更新されるということが書かれている。

Evaluation Timelineにおいては、以下の説明がされている。

最終提出期限後から、このコンテストの評価期間の将来の日付範囲を反映するためにリーダーボードが定期的に更新されます。その時点で、各チームが選択したノートブックがその将来のデータで再実行されます（9月15日にコンテストが終了）。これらの再実行には、2021年7月31日までに更新されたトレーニングデータが含まれます。by Google翻訳

以上のように、7月20前後にTraining Setが更新され、提出締切後には更新データによるスコアの逐次変化が9月15日まで続くので、現時点におけるスコアには、ほとんど意味が無さそうだということになる。

6月20日（日）

Major League Baseball：228 teams, a month to go

As this competition is brought to you in collaboration with the launch of Vertex AI, we're providing GCP coupons for users to try out some of the great, powerful new resources made available through Vertex AI. This includes JupyterLab Notebooks, Explainable AI, hyperparameter tuning through Vizier, and countless other AI training and deployment tools.

このコンテストはVertexAIのリリースと共同で開催されるため、ユーザーがVertexAIを通じて利用できる優れた強力な新しいリソースのいくつかを試すためのGCPクーポンを提供しています。これには、JupyterLab Notebooks、Explainable AI、Vizierによるハイパーパラメータ調整、およびその他の無数のAIトレーニングおよび展開ツールが含まれます。by Google翻訳

Vertex AI

Vertex AI brings AutoML and AI Platform together into a unified API, client library, and user interface. AutoML allows you to train models on image, tabular, text, and video datasets without writing code, while training in AI Platform lets you run custom training code. With Vertex AI, both AutoML training and custom training are available options. Whichever option you choose for training, you can save models, deploy models and request predictions with Vertex AI.

Vertex AIは、AutoMLとAIプラットフォームを統合されたAPI、クライアントライブラリ、およびユーザーインターフェイスに統合します。 AutoMLを使用すると、コードを記述せずに画像、表、テキスト、およびビデオのデータセットでモデルをトレーニングできます。AIプラットフォームでのトレーニングでは、カスタムトレーニングコードを実行できます。 Vertex AIでは、AutoMLトレーニングとカスタムトレーニングの両方が利用可能なオプションです。トレーニングにどのオプションを選択しても、Vertex AIを使用してモデルを保存し、モデルをデプロイし、予測をリクエストできます。by Google翻訳

JupyterLab is a next-generation web-based user interface for Project Jupyter.

Notebooks enables you to create and manage virtual machine (VM) instances that are pre-packaged with JupyterLab.

Notebooks instances have a pre-installed suite of deep learning packages, including support for the TensorFlow and PyTorch frameworks. You can configure either CPU-only or GPU-enabled instances, to best suit your needs.

Your Notebooks instances are protected by Google Cloud authentication and authorization, and are available using a Notebooks instance URL. Notebooks instances also integrate with GitHub so that you can easily sync your notebook with a GitHub repository.

Notebooks saves you the difficulty of creating and configuring a Deep Learning virtual machine by providing verified, optimized, and tested images for your chosen framework.

Introduction to Vertex Explainable AI for Vertex AI

6月23日（水）

Coleridge Initiative：1,610 teams, 5 hours ago

今朝終了した。暫定420位であった。投稿したモデルの中には、170-177位相当のものもあったが、そのモデルは、public_LBが良くなかったことと、そのモデルの計算方法が他のモデルよりも良いと判断することができず、最終投稿の2件に選ぶことができなかった。161位以内がメダル圏内だったので惜しかったと思うが、そのモデルの詳細が理解できず、private_LBが少し良かったのは偶然にすぎない。夜中（早朝）に、新たにデータセットを追加して計算したものは、データセットに過剰適合して、全滅だった。憶測にすぎないが、特定のデータセットを用いた場合に、public_LBスコアが非常に高くなることを誰かが見つけたことによって、自分を含め、多くのチームが、自然言語処理本来の機能よりも、データセットへの適合に注意が向きすぎたのではなかろうか。

6月24日（木）

CommonLit：2,402 teams, a month to go

チューニングのみ。

Google：519 teams, a month to go

チューニングのみ。

＊チューニングのみでは、進歩しない。Discussionで、次のようなアドバイスをしている人がいた。公開コードのコピーは大いにやればよい。ただし、単にコピーして実行し、投稿するのではなく、自分の力量に合ったモデルを探し、そこからモデルのレベルアップを図りながらスコアアップしていくのが望ましい。

6月25日（金）

CommonLit：2,451 teams, a month to go （チューニングで50位）

Discussion：Readability（読みやすさ）とは何かを問うている。

まずは、Discussionを読まずに、自分の頭にあることを引き出してみよう。読みやすい文章は、わかりやすい文章である。わかりやすい文章は、易しい単語を使っている。わかりやすい文章は短い。わかりやすい文章は構文が簡単である。わかりやすい文章は単純である。わかりやすい文章は主語と述語を含んでいる。わかりやすい文章は論理が単純である。わかりやすい文章は具体的である。わかりやすい文章は理解しやすい。わかりやすい文章は直接的である。読みやすい。わかりやすい。理解しやすい。単純である。論理的である。論理が単純である。読みやすい文章は、知っている言葉、事柄、だけを含んでいる。学習した事、知っている事、覚えている事について書かれている。

Discussionを読んでみよう。困ったことに、英語は自分にとってreadabilityが悪い。英語であることが問題なのではなく、readabilityの評価法や評価方法の課題や問題点に関する知識が少ないために理解できないか理解するのに時間がかかるのだ。文章が表現している事柄に関する知識が無いと読めない。domain knowledgeというのかな、領域知識が必要なのだが、今は、WikipediaやGoogle Scholarがあるので便利になった。読みやすい文章は、domain knowledgeを含まず、general knowledgeだけを含む文章ということで良いのか。general knowledgeにも、レベルがある。ここで問われているのは、これ、すなわち、general knowledgeのレベルを判定することになるのかもしれない。Wikipediaを見ると、common knowledgeと混同しないようにとの注意書きがある。domain knowledgeとgeneral knowledgeとcommon knowledgeは互いに重なっている部分があるように思う。Discussionとは違うところに来てしまった。

Discussionに戻る。readabilityは、これまで蓄積されたdomain knowledgeによって定義され、評価尺度も決められている。今は、train_dataをモデルに学ばせて、忠実に再現することが求められているだけ。期待は、分散を小さくし、readabilityの評価精度を上げることだろうと思うが、そんなことしてもスコアは上がらないだろうと思う。train_dataセットが全て！だろうと思う。本当に求めているのはそこではないが、今は、幻想でしかない。

100年先の世界から振り返れば、2021年も、toy problemに四苦八苦していた時代ということになるのだろう。それとも、ヒトが優れた知能を有していると思っていること自体が幻想であって、ヒトがやっていること（知的活動であると思っていること）は、ANNによって、すでに、忠実に再現されているだけでなく、ANNは、ヒトのレベルを超えようとしているのかもしれない。100年先、1000年先の人類は何を達成し、何を求めているだろうか。

6月29日（火）

Optiver：31 teams, 3 months to go

Optiver Realized Volatility Prediction
Apply your data science skills to make financial markets better

概要（自動翻訳）

この競争の最初の3か月で、さまざまなセクターにわたる数百の株式の短期的なボラティリティを予測するモデルを構築します。数億行の非常に詳細な財務データをすぐに利用できるようになり、10分間のボラティリティを予測するモデルを設計できます。モデルは、トレーニング後3か月の評価期間に収集された実際の市場データに対して評価されます。

Volatility（自動翻訳）

ボラティリティは、あらゆるトレーディングフロアで耳にする最も顕著な用語の1つであり、それには正当な理由があります。金融市場では、ボラティリティが価格の変動量を捉えます。ボラティリティが高いと、市場が混乱し、価格が大きく変動します。ボラティリティが低いと、市場はより穏やかで静かになります。オプティバーのようなトレーディング会社にとって、ボラティリティを正確に予測することは、オプションの取引に不可欠です。オプションの価格は、原商品のボラティリティに直接関係しています。

金融関係のコンペは敬遠していたが、参加しなければ何も得られず、参加すれば、domain knlowledge、train_data、予測モデルなどの概略に関する知識を得る機会が増えるので、散歩コースに加えることにした。

XGBRegressorを使った公開コードを使ってみた。

lightgbm.LGBMRegressorを使ってみようかな。

6月30日（水）

Optiver：147 teams, 3 months to go

まだ始まったばかり。参加チーム数は147と少ないので、50位以内に入った！

XGBをLGBMに変えてスコアは良くなったが、偶然にすぎないと思っている。これから、課題に合わせてチューニングしようと思う。

generalizeが重要な課題、specializeが重要な課題、accuracyが重要な課題、・・・

これらが混ざり合っていて、その寄与率は、課題によって異なる。

画像診断の多くは、異常を見逃さないことが重要であることから、真剣に取り組む気になるが、課題によっては、真剣に取り組むのが馬鹿らしく思えてくるようなものもある。スコアを争っていると、本来の目的を忘れてしまうことがある。やりがいを感じて取り組んでいても、スコアが上がらないと興味を失うことがある。どうでもいいテーマでも、スコアが上がると、必死に取り組んで、時間を忘れることがある。ゲームに夢中になっている子供みたいに。Kaggle病にかかってしまったかな。

f:id:AI_ML_DL:20210609104123p:plain — style=167 iteration=500

2021-05-31

Kaggle散歩（2021年6月1日～8月10日：SIIM-FISABIO-RSNA COVID-19 Detection）

2021年6月1日～8月10日：

SIIM-FISABIO-RSNA COVID-19 Detection

Identify and localize COVID-19 abnormalities on chest radiographs

今日から2か月と10日間、このコンペに取り組む。

このコンペは、コンペ内の全てのcodeとdiscussionを見て学ぶことに集中する。

In this competition, you’ll identify and localize COVID-19 abnormalities on chest radiographs.

In particular, you'll categorize the radiographs as negative for pneumonia or typical, indeterminate, or atypical for COVID-19.

You and your model will work with imaging data and annotations from a group of radiologists.

6月1日（火）

Society for Imaging Informatics in Medicine (SIIM)：281 teams, 2 months to go

概要：胸部レントゲン写真からCOVID-19感染者の症状を推測する。

discussion：

過去の類似コンペのトップレベルの解法の紹介がある。

データサイズは約120 GB。画像サイズが大きくてそのままでは扱えないので、1000 x 1000以下のサイズに縮小し、データセットにして公開している人がいる。非常にありがたい。データセット作成に用いたコードが公開されているので、それを参考にして、自分で準備することも良い練習となる。

ケロッピアバターのGMさんが、starter kitを作成しますと宣言。BMSコンペでも行っていて、トップ100位以内のレベルのコードを公開しており、当人のチームは20位以内のハイレベル。

ベストスコアのコードを検索。trainも公開されているコードを探す。この時点では、学ぶだけ。

ベストスコアで検索し、スコアは低いが正確なコードを選び、コピー、実行、commit, submitにより、リーダーボードの下位に顔を出す。LB=0.005

6月2日（水）

Society for Imaging Informatics in Medicine (SIIM)：291 teams, 2 months to go

Evaluation：

standard PASCAL VOC 2010 mean Average Precision (mAP) at IoU > 0.5を用いる。

In this competition, we are making predictions at both a study (multi-image) and image level.：スタディ（マルチ画像）と画像レベルの両方で予測、となってる。

Study-level labelsとImage-level labelsがある。

train_study_level.csvとtrain_image_level.csvの２つのファイルがある。

Study-level labels：

Studies in the test set may contain more than one label. They are as follows:

"negative", "typical", "indeterminate", "atypical"

For each study in the test set, you should predict at least one of the above labels.

The format for a given label's prediction would be a class ID from the above list, a confidence score, and 0 0 1 1 is a one-pixel bounding box.

Study-level labelsのSubmission Fileへの出力例

Id,　　　　　　　　　 PredictionString

2b95d54e4be65_study, negative 1 0 0 1 1
2b95d54e4be66_study, typical 1 0 0 1 1
2b95d54e4be67_study, indeterminate 1 0 0 1 1 atypical 1 0 0 1 1

Image-level labels：

Images in the test set may contain more than one object.

For each object in a given test image, you must predict a class ID of "opacity", a confidence score, and bounding box in format xmin ymin xmax ymax.

If you predict that there are NO objects in a given image, you should predict none 1.0 0 0 1 1, where none is the class ID for "No finding", 1.0 is the confidence, and 0 0 1 1 is a one-pixel bounding box.

Image-level labelsのSubmission Fileへの出力例

Id,　　　　　　　　　 PredictionString

2b95d54e4be68_image, none 1 0 0 1 1
2b95d54e4be69_image, opacity 0.5 100 100 200 200 opacity 0.7 10 10 20 20

以上のように、スタディレベル（"negative", "typical", "indeterminate", "atypical"）と、イメージレベル（"non", "opacity"）の2つに分けられている。

公開コード（train）（EfficientNetB7使用、TPUで動作）のチューニングをやってみた。

Dropoutの導入、全結合層の導入、EfficientNetのサイズの変更、バッチサイズ、学習率の初期値などの変更を検討した。

今日は、オリジナルを超える結果は得られなかった。

TPUの代わりにGPUを使ってみた。EfficientNetB7は使えず、B4に変更した。画像も600 pixelではオーバーフローする。とりあえず、動かそうと思い、224 pixelにすれば、余裕で動くのだが、これが、非常に遅い。

あまりに遅いので、B0に変更して、5Foldまで計算してみた。B7に600 pixelの画像を流すのとは比較にならないだろうと思ったが、予測モデルをデータセットにしてinferenceコードに追加して予測した結果をsubmitしてみたところ、スコアの数値としては比較的近い値となった。

オリジナル：B7 ＆ 600 pixel & TPU --> LB=0.383

変　更　　：B0 & 224 pixel & GPU --> LB=0.373

良いモデルは、どう扱っても、そこそこのスコアになるものだが、B7 --> B0と、600 pixel --> 224 pixelとグレードダウンしても、LBスコアが大きくは下がらなかったことに驚いている。

コンペの問題点：

このコンペは、スタディレベル（"negative", "typical", "indeterminate", "atypical"）と、イメージレベル（"non", "opacity"）の２つの組み合わせ（病状の分類と病状箇所の検出）になっているのだが、イメージレベルの予測結果に対して、評価ミスが生じているようだ。イメージレベルの予測結果を追加・削除してcommitすることによって、イメージレベルの予測データの寄与は、LB=0.001あるいはLB=0.051となったとのこと）両方の寄与が五分五分だと仮定すると、LBスコアは、現在の0.4+レベルではなく倍の0.8+レベルになっていなければならない、ということのようである。

ということで、現状のリーダーボードは、ほぼ、スタディーレベルだけの評価になっている。Kaggle StaffのJulia Elliottさんによれば、来週には改善される予定。

このため、いまだ、参加チームが少ない、ということなのかもしれない。

6月3日（木）

Society for Imaging Informatics in Medicine (SIIM)：312 teams, 2 months to go

公開コード使用：B0 & 224 pixel & GPU--> LB=0.373

このコードはスタディーレベル（分類）専用なので、これ以上やっても意味ないかなとは思うのだが、イメージレベル（検出）は使うモデルが違うので（EfficientDetは両方できる？）今は、スタディーレベルの課題に集中しよう。

224 pixel & GPUで、Lrを変えたり、factor（学習率の減衰率）を変えたり、さらにはB0からB1やB2もやってみたが、途中経過がよくなかったので、最後まで計算せずに終えたこともあって、今日は、LB=0.373を超えることはなかった。

HuBMAPでは、Super-Convergenceを期待して、CyclicLRとOneCycleLRを使っていたので、ここでも使ってみようと思ったのだが、今使っているコードはTensorFlow/Kerasなので、このようなモジュールは無い。

Kerasで検索しても似たようなものは無かったが、TensorFlowに、Useful extra functionality for TensorFlow maintained by SIG-addons.というのがあり、そこに同等のモジュールがあった。

Module: tfa.optimizers

class AdamW: Optimizer that implements the Adam algorithm with weight decay.

class AveragedOptimizerWrapper: Base class for Keras optimizers.

class COCOB: Optimizer that implements COCOB Backprop Algorithm

class ConditionalGradient: Optimizer that implements the Conditional Gradient optimization.

class CyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class DecoupledWeightDecayExtension: This class allows to extend optimizers with decoupled weight decay.

class ExponentialCyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class LAMB: Optimizer that implements the Layer-wise Adaptive Moments (LAMB).

class LazyAdam: Variant of the Adam optimizer that handles sparse updates more

class Lookahead: This class allows to extend optimizers with the lookahead mechanism.

class MovingAverage: Optimizer that computes a moving average of the variables.

class MultiOptimizer: Multi Optimizer Wrapper for Discriminative Layer Training.

class NovoGrad: Optimizer that implements NovoGrad.

class ProximalAdagrad: Optimizer that implements the Proximal Adagrad algorithm.

class RectifiedAdam: Variant of the Adam optimizer whose adaptive learning rate is rectified

class SGDW: Optimizer that implements the Momentum algorithm with weight_decay.

class SWA: This class extends optimizers with Stochastic Weight Averaging (SWA).

class Triangular2CyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class TriangularCyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class Yogi: Optimizer that implements the Yogi algorithm in Keras.

明日は、CyclicalLearningRateを使ってみようと思う。

tfa.optimizers.CyclicalLearningRate(
initial_learning_rate: Union[FloatTensorLike, Callable],
maximal_learning_rate: Union[FloatTensorLike, Callable],
step_size: tfa.types.FloatTensorLike,
scale_fn: Callable,
scale_mode: str = 'cycle',
name: str = 'CyclicalLearningRate'
)

SGD（0.1-2.0）を想定して具体的な数値を入力した例：

clr = tfa.optimizers.CyclicalLearningRate(0.1, 2.0,
step_size=10,
scale_fn=lambda x: 1,
scale_mode= 'cycle',
name= 'CyclicalLearningRate'
)

＜雑談＞AI, Machin Learning, Deep Learning, Data Science, Engineer or Scientist or Programmer, Kaggler：これらの単語からイメージされる領域で、自分が目指す方向を表現するのに適しているのは何か。AIは、漠然としていてつかみどころがない。Data Scienceは、自分の中では前処理のイメージが強い。Wikipediaで調べてみよう。

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Data Scienceは、Machin Learning、Deep Learningだけでなく、あらゆる科学技術分野に対して、横断的に関連しているものと捉えることができるもののようである。

自分にはなじみの薄い用語、data miningについてもWikipediaで調べてみよう。

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.

勝手な解釈かもしれないが、data mining + big data = data scienceということにして、自分の現在および近未来の専門領域は、仮に、Data Scientistとしておこう。

最近の話題を調べるために、data scienceをキーワードに、Google Scholarで検索した。

雑談としては長すぎるような気もするが、面白そうなのでここに紹介しておく。

AutoDS: Towards Human-Centered Automation of Data Science

Dakuo Wang et al., arXiv:2101.05273v1 [cs.HC] 13 Jan 2021

Abstract

Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.

データサイエンス (DS) プロジェクトは、多くの場合、データサイエンティストと分野の専門家 (データ探索、モデルトレーニングなど) の骨の折れるタスクで構成されるライフサイクルに従います。最近まで、機械学習 (ML) の研究者は、これらのタスクでデータワーカーを支援する有望な自動化手法を開発してきました。このホワイトペーパーでは、最新の ML 自動化手法を活用してデータサイエンスプロジェクトをサポートすることを目的とした自動機械学習 (AutoML) システムである AutoDS を紹介します。データワーカーはデータセットをアップロードするだけで、システムは ML 構成を自動的に提案し、データを前処理し、アルゴリズムを選択し、モデルをトレーニングできます。これらの提案は、Web ベースのグラフィカルユーザーインターフェイスとノートブックベースのプログラミングユーザーインターフェイスを介してユーザーに表示されます。私たちは 30 人のプロのデータサイエンティストと共に AutoDS を研究しました。一方のグループは AutoDS を使用し、もう一方のグループは使用しませんでした。データサイエンスプロジェクトを完了しました。予想どおり、AutoDS は生産性を向上させます。しかし驚くべきことに、AutoDS グループによって作成されたモデルの方が品質が高く、エラーが少なくなっていますが、人間の信頼スコアは低いことがわかりました。データサイエンスライフサイクルにおける人間の作業に自動化手法を組み込むための設計の意味を示すことにより、調査結果を反映します。by Google翻訳

6月4日（金）

Society for Imaging Informatics in Medicine (SIIM)：335 teams, 2 months to go

オリジナル：B7 ＆ 600 pixel & TPU --> LB=0.383

B0 & 224 pixel & GPU--> LB=0.373

B1 & 240 pixel & GPU--> LB=0.377

B2 & 260 pixel & GPU--> LB=0.382

6月5日（土）

Society for Imaging Informatics in Medicine (SIIM)：348 teams, 2 months to go

B2 & 260 pixel & GPU--> LB=0.382

TPUが使えるようになったので、GPUと同一条件で計算したが、途中経過がよくなかった。0.77+となるべきところが、0.75+となった。

TPUでは、bfloat16を宣言して使う必要があるらしい。

image = tf.cast(image, tf.bfloat16)

これによって、GPUのfloat32と同等の精度になるらしい。

さらに、KaggleのTPUは8ユニットの並列動作なので、バッチ数をチェックしないと、GPUの場合の8倍になって、実効的な学習率が下がるなどのために、GPUの場合より精度が下がることがある。

少し条件を検討してみたが、次の結果となり、GPUでのLB=0.383には届かなかった。

B2 & 260 pixel & GPU：LB=0.382

B2 & 260 pixel & TPU：LB=0.375

さらに検討する必要がある。なお、この場合、TPUの計算速度はGPUの約4倍であった。

GPUでは、TPUよりも小さなモデルと画素数の組み合わせで、同等の予測精度が得られた。しかしながら、GPUでは、さらに大きなモデルと画素数の組み合わせにすると、メモリーオーバーになる。これについては、バッチ数を下げることで対応できるかどうか調べる予定である。

TPUでは並列計算により、GPUより高速に計算できるのでより大きなモデルと画素数の組み合わせで計算できるようであるが、同レベルのモデルと画素数の組み合わせではGPUより精度が落ちるようである。原因の1つは、bfloat16を使っていなかったことにあることがわかった。条件を同じにすれば、精度は同じになる筈なので、見落としが無いか調べる必要がある。

data_augmentationの検討：

def build_augmenter(with_labels=True):
def augment(img):
img = tf.image.random_flip_left_right(img)
img = tf.image.random_flip_up_down(img)
return img
def augment_with_labels(img, label):
return augment(img), label
return augment_with_labels if with_labels else augment

TensorFlowのData augmentationの説明ページに掲載されているrandom系の例

tf.image.stateless_random_brightness
tf.image.stateless_random_contrast
tf.image.stateless_random_crop
tf.image.stateless_random_flip_left_right
tf.image.stateless_random_flip_up_down
tf.image.stateless_random_hue
tf.image.stateless_random_jpeg_quality
tf.image.stateless_random_saturation

brightnessを追加してみたが、エラーが出て、うまくいかなかった。

tf.image.stateless_random_brightness(image, max_delta=0.5, seed=new_seed)

6月6日（日）

Society for Imaging Informatics in Medicine (SIIM)：366 teams, 2 months to go

公開コードを用い、次の組み合わせで、スタディーレベル（分類）の訓練を予測をやってみる。それぞれに最適条件はあるだろうと思うが、モデルと解像度以外は同一条件としてみた。batch_size=32, Adam(1e-3),

B0 & 224 pixel & TPU：LB=0.367 : 1.000

B1 & 240 pixel & TPU：LB=0.374 : 1.019

B2 & 260 pixel & TPU：LB=0.375 : 1.022

B3 & 300 pixel & TPU：LB=0.378 : 1.030

B4 & 380 pixel & TPU：LB=0.381 : 1.038

B5 & 456 pixel & TPU：LB=0.375 : 1.022

B6 & 528 pixel & TPU：LB=0.379 : 1.033

B7 & 600 pixel & TPU：LB=0.369 : 1.005

B0 ＆224からB4 & 380までは、予想通り、期待通りだと思う。

ところが、その先B5 & 456からB7 & 600 の結果には、大半の人は首を傾げると思う。

EfficientNetの論文のFigure 1には、B0からB7までImagenet Top-1 Accuracyが高くなる様子が図示されている。この図を見ると、誰でも番号の大きいモデルを使いたいと思うはず。この論文が示していることと、上記の実験結果との不一致の原因を調べよう。

EfficientNetの論文のFigure 1は、次のように計算されたと書かれている。

5.2. ImageNet Results for EfficientNet
We train our EfficientNet models on ImageNet using similar settings as (Tan et al., 2019): RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99;
weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed AutoAugment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with drop connect ratio 0.3. As commonly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.

最後の3行が特に重要で、モデルが大きいほど、regularizationの重要性が高くなり、dropout ratioはB0の0.2をB7では0.5にしたと書かれている。

Figure 1だけ見て大きなモデルを使おうとしてもだめだということがわかる。

具体的にどうすれば良いのか。

１．RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs.

２．swish activation

tf.keras.activations.swish

３．fixed AutoAugment policy

https://github.com/google/automl/blob/master/efficientnetv2/autoaugment.py

AutoAugment: Learning Augmentation Strategies from Data

RandAugment: Practical automated data augmentation with a reduced search space

４．stochastic depth (Huang et al., 2016) with drop connect ratio 0.3

Stochastic depth aims to shrink the depth of a network during training, while keeping it unchanged during testing. We can achieve this goal by randomly dropping entire ResBlocks during training and bypassing their transformations through skip connections.

５．linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7

大きなモデルに見合った結果を得るには、相当の準備が必要だということが分かった。

GitHub：google/automl/efficientnetv2

EfficientNetV2

May13/2021: Initial code release for EfficientNetV2 models: accepted to ICML'21.

1. About EfficientNetV2 Models

EfficientNetV2 are a family of image classification models, which achieve better parameter efficiency and faster training speed than prior arts.

Built upon EfficientNetV1, our EfficientNetV2 models use neural architecture search (NAS) to jointly optimize model size and training speed, and are scaled up in a way for faster training and inference speed.

f:id:AI_ML_DL:20210606141805p:plain

6月7日（月）

Society for Imaging Informatics in Medicine (SIIM)：386 teams, 2 months to go

B5 & 456 pixel & TPU：LB=0.375（Batch_size=32）

これは、Batch_size=32でtrainingしたもの。Batch_sizeによる違いを調べてみよう。TPUの残り時間が少ないので、まずは、Batch_size=128にしてみよう。

結果は、LB=0.375、となった。32でも128でもLBスコアは変わらなかった。

B7 & 600 pixel & TPUは、batch_size=32ではLB=0.369となり、128ではLB=0.383であった。よくわからない。何か間違ったのだろうか。

B4 & 380 pixel & TPU $ batch_size=32：LB=0.381

このモデルのAdamをAdamW（weight_decay=1e-4)にしてみた。

その結果、LB=0.384となった。

LookaheadをAdamWとの組み合わせで試した。このとき、パラメータは、デフォルトもしくはexample of usageなどに例示されている値を使った。計算の途中経過が明らかに良くなかったので、Lookahead(AdamW)によるtrainingは途中で中止した。

CyclicLRとdata_augmentationのことをすっかり忘れている。

6月8日（火）

Society for Imaging Informatics in Medicine (SIIM)：402 teams, 2 months to go

TPUを使い切ったので、GPUによる検討に戻ろう。

B2 & 260 pixel & GPU：LB=0.382（batch_size=16）, Adam, 1e-3,

random_brightnessはこれで効果を確認する。

img = tf.image.rabdom_brightness(img, 0.2)

AdamはAdamW(lr=1e-3, weight_decay=1e-4)で試してみる。

training終了間際で、計算が停止していた。Go to Viewerも使えない。どうしたのだろうか。いずれにしても、約3時間が無駄になった。

イメージレベルが手つかずのままだ。検出は、本格的に取り組んだことがないので、お手本コードを探して勉強しよう。イメージレベルの予測結果のスコアリングの不具合もまだ修正されていないようだ。

6月9日（水）

Society for Imaging Informatics in Medicine (SIIM)
：415 teams
, 2 months to go

TPUの使用時間リセット待ち！

6月10日（木）

Society for Imaging Informatics in Medicine (SIIM)：439 teams, 2 months to go

＊＊＊＊＊中断＊＊＊＊＊

6月15日（火）

SIIM：496 teams, 2 months to go (8/9)

ようやく、問題点の修復が進み始めたようだ。

みなさん、こんにちは。今しばらくお待ちいただきますようお願いいたします。ホストチームはいくつかのテストラベルを更新しました。また、画像レベルのラベルのスコアリングに影響を与える問題も修正しました。フィードバックをお寄せいただきありがとうございます。リーダーボードを更新するプロセスを開始します。すべての提出物を再実行するため、これには時間がかかることに注意してください。 by Google翻訳

0.45+で頭打ちになっていたのが、徐々にスコアアップし、今のトップは、0.62となっている。自分が提出している提出データは、テストレベルの予測データのみで、画像レベルの予測データはデフォルトのままである。

ということで、今日から、画像レベルの予測にとりかかる。といっても、公開コードに頼りっぱなしだが。

6月16日（水）

進捗なし

6月17日（木）

進捗なし

6月18日（金）

SIIM：571 teams, 2 months to go

Image-levelの予測結果に対するスコアの再計算が進み、160件くらいが0.500を超えている。

スコアの再計算によって、Study-levelのスコアが0.055アップしていた。（0.384が0.439になっていた。）

最初はそう思ったのだが、そうではなくて、Image-levelの予測結果を、空欄ではなく、デフォルト値にしていたので（公開コードがそうなっていた）、Image-levelのスコアを正しく計算できるようになったことから、Image-levelのデフォルトデータのスコア：0.055が加算された、というのが、スコアアップの原因だとわかった。

＊＊＊散歩中＊＊＊

6月26日（土）

SIIM：706 teams, a month to go

このコンペに対しては、チューニングではなく、モデルの構築段階に踏み込んで、良い結果を出したいと考えている。考えてはいるのだが、難しい。ラベルは専門家が付けているが正確ではない。モデルは、ラベル以上に正確では、スコアは上がらず、かえって下がる。正解のラベルも正確ではないからである。かといって、このモデルが予測した結果こそが真実だ、というためには、そのことを証明しなければならない。専門家の判断を超える判断能力を備えるためには、専門家が見逃している、あるいは、専門家にも見えない情報をを増幅して見えるようにする、あるいはより正しく診断するために考慮しなければならない画像間の相関を見つけ出すことが必要になるのだろう。さらに、それらの証拠を見える化しなければならない。画像診断では、そういうことも起こりうるような気はするが、現状のレベルでは、100％、overfittingである。

7月1日（木）

あと40日

8月1日（日）

寄り道しているうちに、戻ってこれなくなった。

残念だが、ここで終了する。

f:id:AI_ML_DL:20210531004031p:plain — style=166, iteration=500

f:id:AI_ML_DL:20210531004202p:plain — style=166, iteration=50

f:id:AI_ML_DL:20210531004252p:plain — style=166, iteration=5

2021-05-10

Kaggle散歩（5月11日～6月3日）

5月11日～6月3日

Bristol-Myers Squibb – Molecular Translation
Can you translate chemical images to text?

参加する期間は約20日：コンペの課題は、化学構造式の画像から、InChI(International Chemical Identifier)形式のテキストデータを推測すること。

＊3月8日にこのコンペに参加し、そのまま放置していた。

5月11日（火）

Bristol-Myers Squibb：647 teams, 23 days to go

参加チーム数が比較的少ない。課題が専門的過ぎるのだろう。これほど専門的な課題でも、これだけのチームが参加しているというのは、すごいことかな、とも思う。

さて、このコンペ、どのように取り組もうか。

目的：省略

目標：トップ5％以内に入ること。　⇒　取り組み方を間違ったため、目標は、この論文”End-to-End Attention-based Image Captioning”の内容を理解すること、及び、コンペ終了後にトップレベルの解法をできるだけ深く学ぶこと、とする。　⇒　途中でコンペから離れてReinforcement Rearningに取り組むなど、まとまりのないものになってしまった。最後に時間があれば、公開コードのtrainとinferenceを走らせて、実際のコードから少しでも学ばせていただこう。

方法：省略

スコアの計算： Levenshtein distance：レーベンシュタイン距離は、二つの文字列がどの程度異なっているかを示す距離の一種である。編集距離（へんしゅうきょり、英: edit distance）とも呼ばれる。具体的には、1文字の挿入・削除・置換によって、一方の文字列をもう一方の文字列に変形するのに必要な手順の最小回数として定義される[1]。by ウィキペディア

正解がabcde、解答がbbccdだとすると、正解に一致させるためには、2文字を置換する必要があるので、スコアは、2となる。正解の場合のスコアは0となり、5文字全部間違えるとスコアは5となる。文字数が異なると、挿入回数、削除回数が加算される。

リーダーボードのスコアは、sample_submission.csvをそのままsubmitすると、109.63となる。現状、100位が3.70、50位が2.09、Goldが1.13以内、トップが0.65となっている。

これらのスコアを大雑把に評価すると、トップのチームは、100点満点で99点以上、50位でも98点くらいということなので、非常にレベルが高い。

ここまで正確に変換することができるのか。すごいことだな。ここに割って入るのは無理だなと思ってしまう。中途半端なやりかたでは、トップ5％以内には、入れないような気がする。

先のHuBMAPコンペのような”大波乱”がおきるようなことは考えられない。

化学構造式と対応するInChI表現の例。by ウィキペディア

f:id:AI_ML_DL:20210511153524p:plain

InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1

ウィキペディアから借用した構造式は非常に綺麗だが、コンペで提供される構造式は、古い書籍から転載を繰り返したような画像が多く、N, O, S, Pなどの元素記号の識別が容易でない図も含まれている。

このコンペと全く同じ内容の論文が発行されている。

End-to-End Attention-based Image Captioning

Carola Sundaramoorthy, Lin Ziwen Kelvin, Mahak Sarin, and Shubham Gupta

arXiv:2104.14721v1 [cs.CV] 30 Apr 2021

Abstract
In this paper, we address the problem of image captioning specifically for molecular
translation where the result would be a predicted chemical notation in InChI format
for a given molecular structure. Current approaches mainly follow rule-based or
CNN+RNN based methodology. However, they seem to underperform on noisy
images and images with small number of distinguishable features. To overcome this,
we propose an end-to-end transformer model. When compared to attention-based
techniques, our proposed model outperforms on molecular datasets.

概要
この論文では、特に分子翻訳のための画像キャプションの問題に取り組みます。この場合、結果は、特定の分子構造に対してInChI形式で予測される化学表記になります。現在のアプローチは、主にルールベースまたはCNN + RNNベースの方法論に従います。ただし、ノイズの多い画像や識別可能な特徴の数が少ない画像では、パフォーマンスが低下しているように見えます。これを克服するために、エンドツーエンドのtransformer モデルを提案します。 Attentionベースの手法と比較すると、提案されたモデルは分子データセットよりも優れています。by Google翻訳

f:id:AI_ML_DL:20210511172043p:plain

f:id:AI_ML_DL:20210511172235p:plain

この論文での最高スコアは6.95で、現在のリーダーボードでは231位相当である。

コンペも終盤にさしかかっているので、コンペサイトには、非常にレベルの高い情報と公開コードが集まっている。

5月12日（水）

Bristol-Myers Squibb：654 teams, 22 days to go

Discussion：

このコンペ、問題がおきていたようだ。

分子構造の画像SMILES変換AIコンテスト（Dacon molecular to smiles competition, 2020.09.01 ~ 2020.10.09）が韓国で行われ、トップ３のコードに関する情報がオープンになっていて、その情報がKaggleのコンペ内で、ある期間共有されていた。

SMILESとInChIとではフォーマットが異なっているが、高性能な予測モデルの作成にとって参考になる情報が含まれていて、それを利用してハイスコアを得ているチームがあったということのようである。

その状況を不公平と感じた方が、Discussionコーナーにサマリーを掲載し、さらに、他の方々も加わって、トップ３の情報に誰でもアクセスできるようにした、ということのようである。

トップ３のコードがデータセットに格納され、公開されていることを確認した。これを活用できるのは、レベルの高い人に限られるだろうな。自分には難しすぎるように思うが、チャレンジしてみよう。

5月13日（木）

Bristol-Myers Squibb：673 teams, 21 days to go

分子構造をInChIコードに変換するモデル：

分子構造は画像として与えられる。その画質は、新しい教科書に掲載されているような鮮明な画像ではなく、コピーを繰り返して不鮮明になった画像である。O, N, P, Sなどの元素記号は不鮮明であり、1重結合と2重結合が見分けにくいものがあり、立体構造を表す結合なども不鮮明なものがある。斑点状のノイズがのっている。

不鮮明な画像からInChIコードを推定する前に、鮮明な分子構造モデルを作成する。画像を鮮明にするだけでなく、全ての炭素原子を可視化する。各炭素原子の位置にInChIコードに対応する番号を表示する。原子を色分けして、識別しやすくする。

InChIコードを鮮明な分子構造モデルに変換する。元画像とInChIコードから変換した鮮明な分子構造モデルのペアを用いて、元画像を鮮明な分子構造モデルに変換するためのモデルを作る。

鮮明な分子構造をInChIコードに変換することができるモデルに、テスト画像から変換した鮮明な分子構造を入力することにより、正確なInChIコードに変換することができる。

元画像⇒モデルA⇒分子構造⇒モデルB⇒InChIコード。

このようなことは、実際に、できるのだろうか。

5月14日（金）

Bristol-Myers Squibb：688 teams, 20 days to go

けろけろけろっぴのアバターをもつGMの方のDiscussionでの解説と引用文献等をフォローしようと思う。

5月15日（土）

Bristol-Myers Squibb：697 teams, 19 days to go

InChIコードから構造式を描くことはできるのだろうか。

RDKitを使ってInChIコードから分子構造を描くことができるコードが公開されている。

RDKitのDocumentationのトップページにInChIは存在せず、SMILESは11件存在する。

search pageから検索すると、Search finished, found 34 page(s) matching the search query.と表示され、InChIを含むページが34ページあることがわかる。

rdkit.Chem.inchi module

rdkit.Chem.inchi.MolFromInchi(inchi, sanitize=True, removeHs=True, logLevel=None, treatWarningAsError=False)¶
Construct a molecule from a InChI string

これは、InChIコード（文字列）から分子を構築する命令のようである。

構築した分子を描くためには、次のモジュールを使うようだ。

rdkit.Chem.Draw.rdMolDraw2D module

documentationを読めばすぐにコードが書けるわけではない。少なくとも、C++とPythonコードを自由に操ることができるようでないと使えそうもない。

お手本の公開コードがあるので、それを理解しながら、必要な作業を進めよう。

InChICコードからデータベースの画像よりも鮮明な分子構造画像が得られても、その分子構造画像からInChIコードに変換するモデルを作るのは容易ではない。

次の論文が参考になりそうだ。

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI,

Noel M O’Boyle, Journal of Cheminformatics 2012, 4:22

f:id:AI_ML_DL:20210515235510p:plain

Figure 1 An overview of the steps involved in generating Universal and Inchified SMILES. The normalisation step just applies to Inchified SMILES. To simplify the diagram a Standard InChI is shown, but in practice a non-standard InChI (options FixedH and RecMet) is used for Universal SMILES.

この図のように、原子位置の番号を表示した画像を発生することができれば、InChIコードへの変換を、より正確に行えるモデルを作ることができるように思う。

この論文は2012年に発行されているので、韓国で行われたSMILES形式のコードへの変換は、ここに示されているような手順を参考にして行われた可能性が高いように思われる。Kaggleのコンペにおいても、このスキームを利用できるのだろうと思う。

5月16日（日）

Bristol-Myers Squibb：703 teams, 18 days to go

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI, Noel M O’Boyle, Journal of Cheminformatics 2012, 4:22

この論文に学ぶ：

主題は、InChIをベースにして、SMILES表現を標準化すること。

InChIとSMILESは、1行で分子構造を表現し、分子構造の情報を、格納し、表現し、コンピュータ入力するなどの目的に適していて、かつ、分子が一義的に決まるものであり、さらに、コンピュータにも人にも理解できること、簡潔であることなどが求められる。

InChIは、International Chemistry Identifierの名称が示しているように、国際標準化を目指して（NISTを推進母体として）IUPACが推進してきたもののようである。

SMILESは、Simplified Molecular Input Line Entry Systemの名称が示しているように、単純化された分子のライン入力システムで、InChIよりも簡潔で、直観的に把握しやすいものである。

SMILES : C(=O)([O-])C(=O)O

Standard InChI : InChI=1/C2H2O4/c3-1(4)2(5)6/h(H,3,4)(H,5,6)/p-1

論文の1ページ目の最後の方には、SMILESとInChI以外の1行記述方法が列挙されている。SLN, ROSDAL, WLN, MCDLなど。InChIKeyというのもある。ウィキペディアによれば、「hashed InChIとも呼ばれるInChIKeyは、25文字の固定長であるが、デジタル表現なので人間には読むことができない。InChIKeyの仕様は、ウェブでの検索を可能にするために、2007年9月にリリースされた[5]。InChIそのものとは異なり、InChIKeyは一意ではなく、非常に稀ではあるが重複が発生する[6]。」

韓国のコンペで使われたSMILESは、2012年の時点では、最もポピュラーな1行表記方法”The SMILES format is the most popular line notation in use today."とのことであるが、課題は、立体構造の表現が困難であることの他に、標準化されていないことだということで、InChIをベースに標準化を提案しているのがこの論文の内容となっている。SMILESの標準化が進まなかった原因として、提案された方法が立体構造に対応していない、開発品の専売化、フリーソフトは互いに互換性が無く出版もされなかったことなどがあげられている。

1999年にNIST（National Institute of Standards and Technology）において、分子の新しい1行表記方法の開発がすすめられ、InChIが国際標準として提案されたようである。

この論文では、このInChIからの標準的なラベルを用いて、標準的なSMILESを生成するという方法をとっている。オープンソースの様々なケモインフォマティクスライブラリー（Open Babel, Chemistry Development Kit, RDKit, Chemkit, Indigoなど）に含まれるコードを用いることができるようである。

3ページ目には、Inchified SMILESとUniversal SMILESの2種類が定義されている。

論文ではOpen Babel chemoinformatics toolkitを用いた説明となっている。Kaggleの公開コードではRDKitを使っている。

Methods：次の4つのステップにおいて、図の表示、原子位置の番号表示、原子位置の番号の変更などを行うのか？そうではなさそうだ。

structure normalization

canonical labeling

graph traversal

SMILES generation

Fig.1の図は、InChIからSMILESへの変換プロ節の説明のために分子構造を図示し、原子位置番号も理解を助けるために表示しているだけで、実際の変換作業は、Open Babelのobabelコマンドラインプログラムで瞬時に変換されるだけのようにみえる。

The following commands show how to use the obabel command-line program to generate Universal and Inchified SMILES strings for a structure stored in a Mol file:
C:\>obabel figure1.mol -osmi –xU
c1cc(/C=C/F)cc(c1)[N+](=O)[O-]
C:\>obabel figure1.mol -osmi –xI
c1cc(/C=C/F)cc(c1)N(=O)=O

明日は、InChIから分子構造を表示する方法を（公開コードから）学び、得られた分子構造画像のデータセットを作ろう。

5月17日（月）

Bristol-Myers Squibb：715 teams, 17 days to go

もう少し論文を読み進めよう。

論文読んでいてもよくわからないので、OpenBabelを使ってみよう。

OpenBabel 2.4.0をインストールした。

OpenBabelGUIが表示される。左側から、INPUT FORMAT, CONVERT, OUTPUT FORMATとなっている。次の命令をOpenBabelGUI上で実行してみよう。

C:\>obabel figure1.mol -osmi –xU
c1cc(/C=C/F)cc(c1)[N+](=O)[O-]

まずは、左側の入力画面のフォーマットをInChIに、右側の出力画面のフォーマットをSMILESにする。

次に、左側の画面に、論文のFig.1に示されているInChIコードを入力する。

InChI=1S/C8H6FNO2/c9-5-4-7-2-1-3-8(6-7)10(11)12/h1-6H/b5-4+

中央のCONVERTボタンをクリックすると、SMILESコードが出力された。

c1cc(/C=C/F)cc(c1)N(=O)=O

次に、左側は、InChIコードのままで、出力をpng -- PND 2D depictionとし、出力ファイル名を指定しておく。

この状態で中央のCONVERTボタンをクリックすると、指定しておいたファイルの中に次の画像が保存された。

f:id:AI_ML_DL:20210517112650p:plain

それならばと、コンペのデータセットのInChIコードを左側の画面に張り付けて、出力ファイル名を設定しなおして、CONVERTをくりっくすると次のように分子が出力された。

InChI=1S/C13H15NO2/c1-10(14-11(2)16)13-7-5-12(6-8-13)4-3-9-15/h5-8,10,15H,9H2,1-2H3,(H,14,16)

f:id:AI_ML_DL:20210517113405p:plain

分子構造の生成パラメータが詳細に決定できるようになっている。

炭素原子を明示することもできる。

InChI=1S/C38H40N4O6S2/c43-35-25-47-31-17-5-1-13-27(31)37(45)39-21-9-10-22-40-38(46)28-14-2-6-18-32(28)48-26-36(44)42-30-16-4-8-20-34(30)50-24-12-11-23-49-33-19-7-3-15-29(33)41-35/h1-8,13-20H,9-12,21-26H2,(H,39,45)(H,40,46)(H,41,43)(H,42,44)

f:id:AI_ML_DL:20210517121736p:plain

RDKitを用いてInChIコードから分子構造を得るのと同じことを、OpenBabelGUIでできることがわかった。ただし、現状では、InChIコードを1件づつしか処理できない。

InChIコードから綺麗な分子構造に変換できて炭素原子の表示も可能であるが、原子位置の表示はできていない。

とりあえず、train_dataの分子構造図を、InChIコードラベルを使ってクリヤ―な画像に変換できたとして、当初考えていたことが可能かどうか検証してみよう。

test_dataの画像をクリヤーにする方法：

コンペから遠くはなれてしまいそうだが、画質を向上させるということでは、GANやNeural Style Transferなどが使えそうに思うが、これらのコードを使ったことはあるが、その本質は理解できていないように思うので、Generative Deep Learningのテキストを読んでみる。

Generative Deep Learning

by David Foster

この本を読み終えるまでにこのコンペが終了しているかもしれない！何か使える手法はないか探してみよう。何か役に立つ情報が隠れているかもしれない。すぐには使えなくてもいい、次のステップが見えてくるだけいい・・・。

ほんの少しだけ頭を使ったらわかったことは、100％正解できることを、わざわざディープラーニングを使ってやってみようとしているだけのようにみえるということである。このコンペの課題は、本質的には、確率の世界ではなく、決定論の世界である。たとえば、四則演算の規則を教えないで、式の画像と数値の正解のセットを用いてCNN＋RNN/LSTMでtrainingして式の画像から正しいラベルを推定させるのと同じ。同様のことは、人工知能に東京大学の入試問題を解かせようとして成功とまではいかなかったことにも通じるところがある。問題が画像として与えられるために、決定論の問題であっても、確率論の問題におきかわってしまうか、ある確率でしか問題の意図を把握することができないということが、点数があるレベルを超えられなくなる原因になっていたのかもしれない。受験者が持っている知識レベルに到達するために必要な知識をどうやって保持させるかを考えた時、問題ごとにあるいは問題の要素ごとに、問題を解くことができるだけの能力を備えるために必要なデータセットを用意することが必要で、そのための膨大なデータセットを準備する手間が足りなかったのではないだろうか。

東大入学試験問題に限らず、あるいは入試問題に限らず、ニューラルネットワークによって様々な問題が解けるようにすることは、非常に重要なことかもしれない。それは、もしかしたら、自己学習につながるかもしれない。問題が解けるようになるということは、画像情報（試験問題）から疑似知識を習得することになり、算数のレベルから数学のレベルになり、物理、化学、生物などの自然科学から、論理学、経済学、経営学、法学、さらには哲学的的思考にまで発展していく可能性がある。ヒトはどうやって知識を獲得していくのか。ヒトはどうやって思考能力を獲得していくのか。思考の中身はなんだろう。会話の中身は単なる記憶情報の発出ではないのか。

マシンラーニングのラーニング手法についてもっとよく考えようと思うのだが、考える土台になるものを十分には理解できていないような気がしている。ラーニングマシンからシンキングマシンへ、さらにはクリエイティブマシン、リサーチマシンへと進化させていくためにどういう機能が必要なのか。さらに、これらが自己増殖するためにはどういう機能が必要になるのか。

空想（妄想）はこのへんにして、テキストに戻ろう。

今日はここまで・・・。

5月18日（火）

Bristol-Myers Squibb：727 teams, 16 days to go

Genetative Deep Learning by David Fosterの続き：

GenerativeとDiscriminative：自分の印象では、Generativeは似たものを作ることで、Discriminativeは見分けること。似ていることは、見分けにくいことで、見かたが異なるだけのようだが、Generative modeling processとDiscriminative modeling processとの最も大きな違いは、出力である。Generative modelが出力するのは、画像であり、音であり、テキストである。Discriminative modelが出力するのは、ラベルもしくは評価値である。当該コンペの出力はラベルである。

Generative modeling：データがどのように作られているのかを明らかにすることが重要。reinforcement learningの指導原理のように目的を達成するための手段や経路の探索において重要。generative modelingの究極の姿は今この本を読んでいる人間の脳の中がおきていることを理解し再現すること。

reinfoecement learning：Atariのゲームで人を負かすプログラムということで有名になり、その後チェス、将棋、碁においても人を負かすことになったモデル。その後、自動運転、創薬支援、タンパク質の構造解析などにも適用されているようである。重要な技術だということで学ぼうとしているのだが、その本質がよくわからない。

最初にテキストのChapter 8 Playを眺めてみる。

Reinforcement Learningの説明のあと、実例を用いた説明となるのだが、実例がゲームで、CarRacingである。これで、一気に、やる気が失せるんだな。どうしようか。

昨年の9月に、たくさんのコンペに参加した中に、Halite by Two Sigma, Collect the most halite during your match in spaceというのがあった。Reinforcement Learningを試すコンペのようなので、ここでReinforcement Learningを学ぼうと思った。しかし、同時期にいくつかのコンペにも参加していたので、各コンペに避ける時間があまりにも少なく、結局、RLの学習もスコアも中途半端で終わった。このコンペのことを思い出してそのサイトに行って、トップチーム（個人）の解説をざっと読んで驚いた。その方は、なんと、Reinforcement Learningを開発したDeep Mindにおられて熟知されていたようである。当然のことながらプログラミングレベルもはかり知れないものだろうと思う。RLがトップを狙うには不十分ということで、従来型のプログラミングで勝負したとのことである。なんと11,000行、とのこと。Reinforcement Learning自体がだめということではなく、学習時間が足りないということのようである。とてもまねできないなと思ったのは、対戦状況を観察して相手の戦術を読み取ってそれを凌駕する戦術を考えてプログラミングしたことである。

DEEP REINFORCEMENT LEARNING
Yuxi Li (yuxili@gmail.com), arXiv:1810.06339v1 [cs.LG] 15 Oct 2018

ABSTRACT
We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details. We discuss six core elements, six important mechanisms, and twelve applications, focusing on contemporary work, and in historical contexts. We start with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources. Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation. Then we discuss important mechanisms for RL, including attention and memory, unsupervised learning, hierarchical RL, multiagent RL, relational RL, and learning to learn. After that, we discuss RL applications, including games, robotics, natural language processing (NLP), computer
vision, finance, business management, healthcare, education, energy, transportation, computer systems, and, science, engineering, and art. Finally we summarize briefly, discuss challenges and opportunities, and close with an epilogue.

概要スタイルで深層強化学習について説明します。細部にまでこだわった全体像を描きます。現代の仕事に焦点を当て、歴史的な文脈で、6つのコア要素、6つの重要なメカニズム、および12のアプリケーションについて説明します。まず、人工知能、機械学習、深層学習、強化学習（RL）の背景とリソースを使用します。次に、価値関数、ポリシー、報酬、モデル、探索と活用、表現など、RLのコア要素について説明します。次に、注意と記憶、教師なし学習、階層型RL、マルチエージェントRL、リレーショナルRL、学習学習など、RLの重要なメカニズムについて説明します。その後、ゲーム、ロボット工学、自然言語処理（NLP）、コンピュータービジョン、財務、経営管理、ヘルスケア、教育、エネルギー、輸送、コンピューターシステム、科学、工学、芸術などのRLアプリケーションについて説明します。最後に、簡単に要約し、課題と機会について話し合い、エピローグで締めくくります。by Google翻訳

全150ページ、数式が多くて、さっぱりわからん。

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, Justin Fu arXiv:2005.01643v3 [cs.LG] 1 Nov 2020

Abstract
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

このチュートリアル記事では、オフライン強化学習アルゴリズムの研究を開始するために必要な概念ツールを読者に提供することを目的としています。これは、追加のオンラインデータ収集なしで、以前に収集されたデータを利用する強化学習アルゴリズムです。オフライン強化学習アルゴリズムは、大規模なデータセットを強力な意思決定エンジンに変えることを可能にするという大きな可能性を秘めています。効果的なオフライン強化学習手法は、利用可能なデータから最大限の有用性を備えたポリシーを抽出できるため、医療や教育からロボット工学まで、幅広い意思決定ドメインの自動化が可能になります。ただし、現在のアルゴリズムの制限により、これは困難です。特に現代の深層強化学習方法の文脈で、読者にこれらの課題の理解を提供することを目指し、これらの課題を軽減するために最近の研究で探求されたいくつかの潜在的な解決策、最近のアプリケーション、および議論について説明しますフィールドの未解決の問題に関する視点の。by Google翻訳

全34ページ、数式が多くて、さっぱりわからん。

次の論文は、最初に紹介したテキストGenerative Deep Learning by David Fosterの第8章のReinforcement Learningのメインストーリーの土台となっているもので、非常に丁寧に書かれた必読文献だと思う。

World Models
David Ha and Jurgen Schmidhuber, arXiv:1803.10122v4 [cs.LG] 9 May 2018

Abstract
We explore building generative neural network models of popular reinforcement learning
environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer
this policy back into the actual environment.

人気のある強化学習環境の生成ニューラルネットワークモデルの構築を検討します。私たちの世界モデルは、教師なしの方法ですばやくトレーニングして、環境の圧縮された空間的および時間的表現を学習できます。ワールドモデルから抽出された特徴をエージェントへの入力として使用することで、必要なタスクを解決できる非常にコンパクトでシンプルなポリシーをトレーニングできます。エージェントを、その世界モデルによって生成された独自の幻覚の夢の中で完全にトレーニングし、このポリシーを実際の環境に戻すこともできます。by Google翻訳

Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal
model. Jay Wright Forrester, the father of system dynamics, described a mental model as:
The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system. (Forrester, 1971)

人間は、限られた感覚で知覚できるものに基づいて、世界のメンタルモデルを開発します。私たちが行う決定と行動は、この内部モデルに基づいています。システムダイナミクスの父であるジェイライトフォレスターは、メンタルモデルを次のように説明しました。私たちが頭に抱えている私たちの周りの世界のイメージは、単なるモデルです。彼の頭の中の誰も、すべての世界、政府、または国を想像していません。彼は概念とそれらの間の関係のみを選択し、それらを使用して実際のシステムを表現しています。（フォレスター、1971年）by Google翻訳

To handle the vast amount of information that flows through our daily lives, our brain learns an abstract representation of both spatial and temporal aspects of this information. We are able to observe a scene and remember an abstract description thereof. Evidence also suggests that what we perceive at any given moment is governed by our brain’s prediction of the future based on our internal model.

私たちの日常生活を流れる膨大な量の情報を処理するために、私たちの脳はこの情報の空間的側面と時間的側面の両方の抽象的な表現を学習します。シーンを観察し、その抽象的な説明を思い出すことができます。証拠はまた、私たちがいつでも知覚するものは、私たちの内部モデルに基づく私たちの脳の将来の予測によって支配されていることを示唆しています。by Google翻訳

5月19日（水）

Bristol-Myers Squibb：746 teams, 15 days to go

このコンペに対して、日々、どういうふうに取り組んでいるのかを書きとめているつもりだが、脇道に入り込んでいる状態で、今日も、脇道を進んでいく。

Reinforcement Learning：

学ぼうと思って何度かトライしたが、うまくいかない。昨日も、いくつかの論文に目を通し、Kaggleのコースを見たり、F. CholletさんやA. Geronさん、D. Fosterさんのテキストをながめていても頭に入ってこない。なんとなくわかってきたことは、Reinforcement Learningは、その起源が古く、ある意味、基本的な事項はわかっているものとして基礎的な説明が省略されているのではないかということである。

ということで、Google ScholarでReinforcement Learningを検索して、次の文献を読むことにした。25年前の論文である。

Reinforcement Learning: A Survey

Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore,

Journal of Artificial Intelligence Research 4 (1996) 237-285

Abstract
This paper surveys the field of reinforcement learning from a computer-science per-
spective. It is written to be accessible to researchers familiar with machine learning. Both
the historical basis of the field and a broad selection of current work are summarized.
Reinforcement learning is the problem faced by an agent that learns behavior through
trial-and-error interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but differs considerably in the details and in the use
of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

この論文は、コンピュータサイエンスの観点から強化学習の分野を調査します。機械学習に精通した研究者がアクセスできるように書かれています。この分野の歴史的根拠と現在の研究の幅広い選択の両方が要約されています。強化学習は、動的環境との試行錯誤の相互作用を通じて行動を学習するエージェントが直面する問題です。ここで説明する作業は心理学での作業に似ていますが、詳細と使用法がかなり異なります。この論文では、探索と活用のトレードオフ、マルコフ決定理論によるフィールドの基盤の確立、遅延強化からの学習、学習を加速するための経験的モデルの構築、一般化と階層の利用、そして隠された状態への対処など、強化学習の中心的な問題について説明しています。それはいくつかの実装されたシステムの調査と強化学習のための現在の方法の実用性の評価で終わります。by Google翻訳+一部修正

本文30ページの1/3くらい目を通してみた感想は、キラーアプリが見つかっていないためなのか、議論が理論的、概念的で、具体性がなく、同じところをぐるぐる回っているように感じることである。過去に学ぶのも良いが入り込みにくい世界でもある。

自動運転でReinforcement Learningがどう使われているかを調べてみよう。Reinforcement Learningとゲームの組み合わせよりも、断然、やる気が出る。

Deep Reinforcement Learning for Autonomous Driving: A Survey
B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, and Patrick Pérez, arXiv:2002.00444v2 [cs.LG] 23 Jan 2021

Abstract—With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of
simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
Index Terms—Deep reinforcement learning, Autonomous driving, Imitation learning, Inverse reinforcement learning, Controller learning, Trajectory optimisation, Motion planning, Safe reinforcement learning.

要約—深層表現学習の開発により、強化学習（RL）のドメインは、高次元環境で複雑なポリシーを学習できる強力な学習フレームワークになりました。このレビューでは、深層強化学習（DRL）アルゴリズムを要約し、自動運転エージェントの実際の展開における主要な計算上の課題に対処しながら、（D）RLメソッドが採用されている自動運転タスクの分類法を提供します。また、関連しているが古典的なRLアルゴリズムではない、動作の複製、模倣学習、逆強化学習などの隣接するドメインについても説明します。トレーニングエージェントにおけるシミュレータの役割、RLの既存のソリューションを検証、テスト、および堅牢化する方法について説明します。
インデックス用語-深層強化学習、自律運転、模倣学習、逆強化学習、コントローラー学習、軌道最適化、動作計画、安全強化学習。by Google翻訳

The main contributions of this work can be summarized as follows:
・Self-contained overview of RL background for the automotive community as it is not well known.
・Detailed literature review of using RL for different autonomous driving tasks.
・Discussion of the key challenges and opportunities for RL applied to real world autonomous driving.
The rest of the paper is organized as follows.

Section II provides an overview of components of a typical autonomous driving system.

Section III provides an introduction to reinforcement learning and briefly discusses key concepts.

Section IV discusses more sophisticated extensions on top of the basic RL framework.

Section V provides an overview of RL applications for autonomous driving problems.

Section VI discusses challenges in deploying RL for real-world autonomous driving systems.

Section VII concludes this paper with some final remarks.

f:id:AI_ML_DL:20210519144158p:plain

f:id:AI_ML_DL:20210519220609p:plain

この表のキャプションは、OPEN-SOURCE FRAMEWORKS AND PACKAGES FOR STATE OF THE ART RL/DRL ALGORITHMS AND EVALUATION.（最先端のRL / DRLアルゴリズムと評価のためのオープンソースフレームワークとパッケージ。by Google翻訳）

この表の最後にあるDeepMindの"bsuit"の論文を見よう。

Behaviour Suite for Reinforcement Learning
Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado Van Hasselt, arXiv:1908.03568v3 [cs.LG] 14 Feb 2020
Abstract
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source
github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a
periodic review of bsuite from a committee of prominent researchers.
このホワイトペーパーでは、強化学習のためのBehavior Suite、または略してbsuiteを紹介します。 bsuiteは、2つの目的を持つ強化学習（RL）エージェントのコア機能を調査する慎重に設計された実験のコレクションです。まず、一般的で効率的な学習アルゴリズムの設計における重要な問題を捉えた、明確で有益でスケーラブルな問題を収集します。次に、これらの共有ベンチマークでのパフォーマンスを通じてエージェントの動作を調査します。この取り組みを補完するために、私たちはオープンソース
 github.com/deepmind/bsuite。bsuite上のエージェントの評価と分析を自動化します。このライブラリは、RLの主要な問題に関する再現性のあるアクセス可能な研究を促進し、最終的には優れた学習アルゴリズムの設計を促進します。私たちのコードはPythonであり、既存のプロジェクト内で簡単に使用できます。 OpenAIベースライン、ドーパミン、および新しいリファレンス実装の例が含まれています。今後は、研究コミュニティからのより優れた実験を取り入れ、著名な研究者の委員会による定期的なbsuiteのレビューに取り組んでいきたいと考えています。by Google翻訳

Interest in artificial intelligence has undergone a resurgence in recent years. Part of this
interest is driven by the constant stream of innovation and success on high profile challenges previously deemed impossible for computer systems. Improvements in image recognition are a clear example of these accomplishments, progressing from individual digit recognition (LeCun et al., 1998), to mastering ImageNet in only a few years (Deng et al., 2009; Krizhevsky et al., 2012). The advances in RL systems have been similarly impressive: from checkers (Samuel, 1959), to Backgammon (Tesauro, 1995), to Atari games (Mnih et al., 2015a), to competing with professional players at DOTA (Pachocki et al., 2019) or StarCraft (Vinyals et al., 2019) and beating world champions at Go (Silver et al., 2016). Outside of playing games, decision systems are increasingly guided by AI systems (Evans & Gao, 2016).

近年、人工知能への関心が復活しています。この関心の一部は、以前はコンピュータシステムでは不可能と考えられていた注目を集める課題に対する革新と成功の絶え間ない流れによって推進されています。画像認識の改善は、これらの成果の明確な例であり、個々の数字の認識（LeCun et al, 1998）からわずか数年でImageNetを習得する（Deng et al, 2009; Krizhevsky et al, 2012）まで進んでいます。 RLシステムの進歩も同様に印象的でした。チェッカー（Samuel, 1959）、バックギャモン（Tesauro, 1995）、アタリゲーム（Mnih et al, 2015a）、DOTAでのプロプレーヤーとの競争（Pachocki et al, 2019）またはStarCraft（Vinyals et al, 2019）およびGo（Silver et al, 2016）で世界チャンピオンを破っています。ゲームをプレイする以外に、意思決定システムはますますAIシステムによって導かれています（Evans＆Gao, 2016年）。by Google翻訳

As we look towards the next great challenges for RL and AI, we need to understand our systems better (Henderson et al., 2017). This includes the scalability of our RL algorithms, the environments where we expect them to perform well, and the key issues outstanding in the design of a general intelligence system. We have the existence proof that a single self-learning RL agent can master the game of Go purely from self-play (Silver et al., 2018). We do not have a clear picture of whether such a learning algorithm will perform well at driving a car, or managing a power plant. If we want to take the next leaps forward, we need to continue to enhance our understanding.

RLとAIの次の大きな課題に目を向けるとき、システムをよりよく理解する必要があります（Henderson et al, 2017）。これには、RLアルゴリズムのスケーラビリティ、それらが適切に機能すると予想される環境、および未解決の主要な問題が含まれます。
一般的なインテリジェンスシステムの設計において。単一の自己学習RLエージェントが純粋に自己プレイから囲碁のゲームを習得できるという存在証明があります（Silver et al, 2018）。そのような学習アルゴリズムが車の運転や発電所の管理でうまく機能するかどうかについては、明確な見通しがありません。次の飛躍を遂げたいのであれば、理解を深めていく必要があります。 by Google翻訳

これは、Reinforcement Learningを復活させたDeepMindの研究者らの論文である。DeepMindの創業者はゲームの達人だったようで、Reinforcement Learningを再発見し、コンピュータゲームとボードゲームを征服し、Reinforcement Learningをさらに発展させ、応用範囲を拡げているようである。

囲碁、将棋、チェスでヒトを超えた。車の運転や発電所の管理でも既にヒトを超えている。後者は前者にはないセンサーや計測器からのリアルタイム情報が必要だというだけで、基本は同じなんだろうか。

5月20日（木）

Bristol-Myers Squibb
, 749 teams
14 days to go

99％以上の正確さで分子構造のぼやけた画像からInChIコードに変換できるというのは、コード変換規則を正しく適用しているだけにしか見えないくらい凄いことだと思う。

自分の入る余地は無い。解法を楽しみに待つ。

Reinforcement Learning：

ひきつづき、"Behaviour Suite for Reinforcement Learning" by DeepMindに学ぼう。

1.1 Practical theory often lags practical algorithms：

The current theory of deep RL is still in its infancy. In the absence of a comprehensive theory, the community needs principled benchmarks that help to develop an understanding of the strengths and weakenesses of our algorithms.

ディープRLの現在の理論はまだ揺籃期にあります。包括的な理論がない場合、コミュニティは、アルゴリズムの長所と短所の理解を深めるのに役立つ原則的なベンチマークを必要としています。by Google翻訳

1.2 An ‘MNIST’ for reinforcement learning

Just as the MNIST dataset offers a clean, sanitised, test of image recognition as a stepping stone to advanced computer vision; so too bsuite aims to instantiate targeted experiments for the development of key RL capabilities.

1.3 Open source code, reproducible research

As part of this project we open source github.com/deepmind/bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite. This library serves to facilitate reproducible and accessible research on the core issues in reinforcement learning.

1.4 Related work

2 Experiments

2.1 Example experiment: memory length

f:id:AI_ML_DL:20210520162152p:plain

f:id:AI_ML_DL:20210520162248p:plain

読んで理解しようとしたが、課題が何なのか、次の語句が何なのか（DQNは見たことがあるという程度）、さっぱりわからない。
actor-critic with a recurrent neural network

feed-forward DQN

Bootstrapped DQN

A2C

A. Geronさんの第2版のテキストの18章 Reinforcement Learningに、これらの語句の説明があるので、次は、この章を見ていくことにする。全57ページ。

ゲームの話から始まるので、挑戦するたびに挫折したが、もう引くに引けない状況になった。これをクリヤ―しない事には、人工知能の世界では生き残れないので。

始まりは1950年ごろとある。私が生まれた頃だな。アッ、年がばれた。

注目されたのが2013年。DeepMindのAtariに対する取り組み。ゲームのルールの情報を与えることなく、画面情報だけから、ヒトに優るスコアを出したとのこと。

DeepMindはGoogleに$500 millionで買収された。

Deep Reinforcement Learningと称される手法の重要な技術要素は、policy gradientsとdeep Q networksと、Markov decision processである。

これらを、動くカート上のポールバランスに適用する。
TensorFlow-Agentsライブラリーを導入する。

このライブラリーを使って、Atariの有名なゲームBreakoutをプレーするエージェントを訓練する。

結局は、ゲームか、と思ってしまうが、これをクリヤしないと先はないということだと観念して取り組もう。Bristol-Myers Squibbコンペが終了するまでに。

Hands-One Machine Learning with Scikit-Learn, Keras & TensorFlow by A. Geron, Second Edition September 2019

Chapter 18 Reinforcement Learning

Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today, and also one of the oldest.

DeepMind applied the power of Deep Learning to the field of Reinforcement Learning, and it worked beyond their wildest dreams.

In this chapter we will first explain what Reinforcement Learning is and what it's good at, then present two of the most important techniques in Deep Reinforcement Learning: polycy gradients and deep Q-networks (DQNs), including a discussion of Markov decision processes (MDPs).

5月21日（金）

Bristol-Myers Squibb：756 teams, 13 days to go

休眠中：スコア1.0でも14位とは、すごいな。

分子構造の画像を修正するコードを見た。ドット状のノイズを消去する、直線の欠けたドットを補充する、周辺の不要部分をカットする、などだが、これに加えて、炭素原子を追加するとか、炭素の位置情報を追加するなどの作業をreinforcement learningでやるためにはどうすればよいのだろうか。

Learning to Optimize Rewords

agentは、設定された環境においてwithin an environment、状況を観察/把握しmake observation、行動するtakes actionsことによって、報酬rewardsを得る。

task例

ロボット：agentはロボットを制御するプログラム、environmentはロボットの行動範囲、環境の把握(make obsevations)はカメラによる画像やタッチセンサーからの信号によって行う、actionはロボットを移動させる信号の送付や動作させるための信号の送付、rewardsは目的地への到達や所定の動作や所要時間などの達成度となる。

Pac-Man：agentはPac-Manをコントロールするプログラム、environmentはAtari gameのシミュレーション（意味がわからない：わかった。ゲームに勝つagentをtrainingするために、実際のゲームと同じ動作をさせるということだ。：どのタスクにおいても、trainingするためには実際のenvironmentを仮想的に作り出すことが必要なのだ。これは、supervised learningにおけるdata-labelセットと同様の役割を担うということか。：自動運転のagentをtrainingするためには、実際に道路を走らせて画像やレーダー信号などを収集するする必要があるということ。）、actionはジョイスティックの位置設定、observationsはスクリーンショット、rewardsは得点となる。

囲碁：Pac-Manと同様、rewardsは、占有領域となる。

サーモスタット：わざわざreinforcement learningを導入して何をしようとするのかわかない。目的物の温度が設定温度に近いほどrewardsを高くする。目的物の温度はセンサーで検知する。加熱装置と冷却装置を交互に動作させるのか、過熱または冷却のみを動作させて設定温度に近づけるのか。キーワードは、smart thermosutatであった。

smart thermostat：ウィキペディアから関係ありそうな箇所を引用：

プログラム可能なスケジュールと自動スケジュール
スマートサーモスタットのプログラム可能なスケジュール機能は、標準のプログラム可能なサーモスタットの機能と似ています。ユーザーは、家から離れているときにエネルギー使用量を減らすためにカスタムスケジュールをプログラムするオプションが与えられます。ただし、調査によると、スケジュールを手動で作成すると、サーモスタットを設定温度に維持するよりも多くのエネルギー使用量につながる可能性があります。[8]この問題を回避するために、スマートサーモスタットは自動スケジュール機能も提供します。この機能では、アルゴリズムとパターン認識を使用して、乗員の快適さとエネルギー節約につながるスケジュールを作成する必要があります。スケジュールを作成すると、サーモスタットは乗員の行動を監視し続け、自動スケジュールを変更します。スケジューリングから人為的エラーを取り除くことにより、スマートサーモスタットは実際にエネルギーを節約するスマートスケジュールを作成できます。[13]

agentはサーモスタットの動作条件を制御するプログラムということか。

他には、株式取引、recommender system, placing ads on a web page, controlling where an image classification system should focus its attentionなどがある。

Reinforcement Learningを適用するのに適したタスクにはどのようなものがあるのか。どのような応用分野があるのかを調査する必要がありそうだ。

Reinforcement Learning, suited tasks, applied areas, ..., 適当なキーワードを用いてGoogle Scholarで検索してみよう。

Reinforcement learning for personalization: A systematic literature review

Floris den Hengst et al., Data Science 3 (2020) 107–147

Abstract.

The major application areas of reinforcement learning (RL) have traditionally been game playing and continuous control. In recent years, however, RL has been increasingly applied in systems that interact with humans. RL can personalize digital systems to make them more relevant to individual users. Challenges in personalization settings may be different from challenges found in traditional application areas of RL. An overview of work that uses RL for personalization, however, is lacking. In this work, we introduce a framework of personalization settings and use it in a systematic literature review. Besides
setting, we review solutions and evaluation strategies. Results show that RL has been increasingly applied to personalization problems and realistic evaluations have become more prevalent. RL has become sufficiently robust to apply in contexts that involve humans and the field as a whole is growing. However, it seems not to be maturing: the ratios of studies that include a comparison or a realistic evaluation are not showing upward trends and the vast majority of algorithms are used only once. This review can be used to find related work across domains, provides insights into the state of the field and identifies opportunities for future work.
強化学習（RL）の主な応用分野は、伝統的にゲームプレイと継続的な制御でした。しかし、近年、RLは人間と相互作用するシステムにますます適用されています。 RLは、デジタルシステムをパーソナライズして、個々のユーザーとの関連性を高めることができます。パーソナライズ設定の課題は、RLの従来のアプリケーション分野で見られる課題とは異なる場合があります。ただし、パーソナライズにRLを使用する作業の概要は不足しています。この作業では、パーソナライズ設定のフレームワークを紹介し、系統的文献レビューで使用します。設定に加えて、ソリューションと評価戦略を確認します。結果は、RLがパーソナライズの問題にますます適用され、現実的な評価がより一般的になっていることを示しています。 RLは、人間が関与するコンテキストに適用するのに十分な堅牢性を備えており、フィールド全体が成長しています。ただし、成熟していないようです。比較または現実的な評価を含む研究の比率は上昇傾向を示しておらず、アルゴリズムの大部分は1回しか使用されていません。このレビューは、ドメイン間で関連する作業を見つけるために使用でき、フィールドの状態への洞察を提供し、将来の作業の機会を特定します。by Google翻訳

The Societal Implications of Deep Reinforcement Learning

Jess Whittlestone et al., Journal of Artificial Intelligence Research 70 (2021) 1003–1030

Abstract
Deep Reinforcement Learning (DRL) is an avenue of research in Artificial Intelligence
(AI) that has received increasing attention within the research community in recent years,
and is beginning to show potential for real-world application. DRL is one of the most
promising routes towards developing more autonomous AI systems that interact with and take actions in complex real-world environments, and can more flexibly solve a range of problems for which we may not be able to precisely specify a correct ‘answer’. This could have substantial implications for people’s lives: for example by speeding up automation in various sectors, changing the nature and potential harms of online influence, or introducing new safety risks in physical infrastructure. In this paper, we review recent progress in DRL, discuss how this may introduce novel and pressing issues for society, ethics, and governance, and highlight important avenues for future research to better understand DRL’s societal implications.

Deep Reinforcement Learning（DRL）は、近年研究コミュニティ内でますます注目を集めている人工知能（AI）の研究手段であり、実際のアプリケーションの可能性を示し始めています。 DRLは、複雑な実世界の環境と相互作用してアクションを実行する、より自律的なAIシステムを開発するための最も有望なルートのひとつであり、正しい答えを正確に特定できない可能性のあるさまざまな問題をより柔軟に解決できます。これは、人々の生活に大きな影響を与える可能性があります。たとえば、さまざまなセクターでの自動化の高速化、オンラインの影響の性質と潜在的な害の変化、物理インフラストラクチャへの新しい安全上のリスクの導入などです。このホワイトペーパーでは、DRLの最近の進歩を確認し、これが社会、倫理、ガバナンスに斬新で差し迫った問題をどのようにもたらすかについて説明し、DRLの社会的影響をよりよく理解するための将来の研究のための重要な手段を強調します。by Google翻訳

5月22日（土）

Bristol-Myers Squibb, 764 teams 12 days to go

トップ50のスコアは日々更新されている。凄いことだ。仲間に入りたいものだ。

Deep Reinforcement Learning：

Reinforcement learning for personalization: A systematic literature review

この論文が何を扱っているのかまだ理解できていない。

Wikipedia : Personalization (broadly known as customization) consists of tailoring a service or a product to accommodate specific individuals, sometimes tied to groups or segments of individuals. A wide variety of organizations use personalization to improve customer satisfaction, digital sales conversion, marketing results, branding, and improved website metrics as well as for advertising. Personalization is a key element in social media and recommender systems.

systematic literature review (SLR)：この論文が、Reinforcement learning for personalizationの内容の論文をレビューしているのだと思ったが、SLRにRLを適用し、personalizationにLRを適用した文献の調査を行ったということなのだろうか。

もう1つの論文を見てみよう。

The Societal Implications of Deep Reinforcement Learning

1. Introduction

Our discussion aims to provide important context and a clear starting point for the AI ethics and governance community to begin considering the societal implications of DRL in more depth.

私たちの議論は、AI倫理およびガバナンスコミュニティがDRLの社会的影響をより深く検討し始めるための重要なコンテキストと明確な出発点を提供することを目的としています。by Google翻訳

この論文も芯を外しているようだ。

2. Deep Reinforcement Learning: a Brief Overview

確かに、簡単な概要だ。共感できるものが無かった。受け取り側が悪いということにしておこう。

Acknowledgementsの手前の、最後のパラグラフの和訳を張り付けて終わろうと思う。

DRLは、AIの将来において重要な役割を果たす可能性が高く、ますます自律的で柔軟なシステムを約束し、ますますハイステークスドメインでのアプリケーションの可能性を秘めています。その結果、DRLは、AIの安全で責任ある使用に関する議論を形作る多くの懸念をもたらし、悪化させ、AIの影響のより一般的な処理では見落とされる可能性があります。社会にとって最も差し迫った課題はAI研究の進歩の正確な性質に依存するため、将来のAIシステムの課題に備えるために、AIの進歩とAIガバナンスに取り組むグループ間の強力なコラボレーションを構築および維持することが重要です。

by Google翻訳

今日も内容の薄いものになってしまった。

A. Geronさんのテキストに戻ろう。

Plycy Search

The algorithm a software agent uses to determine its actions is called its policy. The policy could be a neural network taking observations as inputs and outputting the action to take (see Figure 18-2).

Figure 18-2. Reinforcement Learning using a neural network polycy：この図は何度も見ていたのだが、今日、ようやくこの模式図の意味するところがわかったように思う。

左側にAgentの枠が描かれ、右側にEnvironmentの枠が描かれている。Agentの枠からenvironmentの枠に向かう→はActionsで、EnvironmentからAgentに向かう→はRewarsとObservationsである。よく見かける模式図だが、違うのは、Agentの枠内にヒト形ロボットの模式図とその頭部から雲形の吹き出しが描かれ、吹き出しの中にニューラルネットワークの模式図が描かれていて、入力側のノードはRewardsとObservationsが接続され、出力側のノードはActionsが接続されていることである。

この後にPolycyの説明が図を含めて1ページくらい続く。Agentは掃除ロボットで、stochastic policy, policy space, genetic algorithmsなどの用語の簡単な説明がある。そのあとで、rewardsのpolicy parameterに対する勾配を用いた最適化に言及し、このpolicy gradient (PG)の良く知られたアルゴリズムをTensorFlowで実装すると述べ、agentの居場所であるenvironmentを作るために必要なOpenAI Gymの導入となる。

5月23日（日）

Bristol-Myers Squibb：772 teams, 11 days to go

トップのスコアが0.60になり、チームも入れ替わった。1.00以下が15チーム。

Deep Reinforcement Learningと物理化学

物理や化学への応用事例がないか探していて、最初に目に付いたのが、次の論文。

Structure prediction of surface reconstructions by deep reinforcement learning
Søren A Meldgaard, Henrik L Mortensen, Mathias S Jørgensen and Bjørk Hammer

Published 8 July 2020 • © 2020 IOP Publishing Ltd
Journal of Physics: Condensed Matter, Volume 32, Number 40

Abstract
We demonstrate how image recognition and reinforcement learning combined may be used to determine the atomistic structure of reconstructed crystalline surfaces. A deep neural network represents a reinforcement learning agent that obtains training rewards by interacting with an environment. The environment contains a quantum mechanical potential energy evaluator in the form of a density functional theory program. The agent handles the 3D atomistic structure as a series of stacked 2D images and outputs the next atom type to place and the atomic site to occupy. Agents are seen to require 1000–10 000 single point density functional theory evaluations, to learn by themselves how to build the optimal surface reconstructions of anatase TiO2(001)-(1 × 4) and rutile SnO2(110)-(4 × 1).

Gaussian representation for image recognition and reinforcement learning of atomistic structure

Mads-Peter V Christiansen, Henrik Lund Mortensen, Søren Ager Meldgaard, and Bjørk Hammer, J Chem Phys. 2020 Jul 28;153(4):044107

Abstract
The success of applying machine learning to speed up structure search and improve property prediction in computational chemical physics depends critically on the representation chosen for the atomistic structure. In this work, we investigate how different image representations of two planar atomistic structures (ideal graphene and graphene with a grain boundary region) influence the ability of a reinforcement learning algorithm [the Atomistic Structure Learning Algorithm (ASLA)] to identify the structures from no prior knowledge while interacting with an electronic structure program. Compared to a one-hot encoding, we find a radial Gaussian broadening of the atomic position to be beneficial for the reinforcement learning process, which may even identify the Gaussians with the most favorable broadening hyperparameters during the structural search. Providing further image representations with angular information inspired by the smooth overlap of atomic positions method, however, is not found to cause further speedup of ASLA.

Predictive Synthesis of Quantum Materials by Probabilistic Reinforcement Learning
Pankaj Rajak, Aravind Krishnamoorthy, Ankit Mishra, Rajiv Kalia, Aiichiro Nakano and Priya Vashishta, arXiv.org > cond-mat > arXiv:2009.06739v1, [Submitted on 14 Sep 2020]

Abstract
Predictive materials synthesis is the primary bottleneck in realizing new functional and quantum materials. Strategies for synthesis of promising materials are currently identified by time consuming trial and error approaches and there are no known predictive schemes to design synthesis parameters for new materials. We use reinforcement learning to predict optimal synthesis schedules, i.e. a time-sequence of reaction conditions like temperatures and reactant concentrations, for the synthesis of a prototypical quantum material, semiconducting monolayer MoS2, using chemical vapor deposition. The predictive reinforcement leaning agent is coupled to a deep generative model to capture the crystallinity and phase-composition of synthesized MoS2 during CVD synthesis as a function of time-dependent synthesis conditions. This model, trained on 10000 computational synthesis simulations, successfully learned threshold temperatures and chemical potentials for the onset of chemical reactions and predicted new synthesis schedules for producing well-sulfidized crystalline and phase-pure MoS2, which were validated by computational synthesis simulations. The model can be extended to predict profiles for synthesis of complex structures including multi-phase heterostructures and can also predict long-time behavior of reacting systems, far beyond the domain of the MD simulations used to train the model, making these predictions directly relevant to experimental synthesis.

Learning to grow: control of material self-assembly using evolutionary reinforcement learning
Stephen Whitelam and Isaac Tamblyn, arXiv:1912.08333v3 [cond-mat.stat-mech] 28 May 2020

We show that neural networks trained by evolutionary reinforcement learning can enact efficient molecular self-assembly protocols. Presented with molecular simulation trajectories, networks learn to change temperature and chemical potential in order to promote the assembly of desired structures or choose between competing polymorphs. In the first case, networks reproduce in a qualitative sense the results of previously-known protocols, but faster and with higher fidelity; in the second case they identify strategies previously unknown, from which we can extract physical insight. Networks that
take as input the elapsed time of the simulation or microscopic information from the system are both effective, the latter more so. The evolutionary scheme we have used is simple to implement and can be applied to a broad range of examples of experimental self-assembly, whether or not one can monitor the experiment as it proceeds. Our results have been achieved with no human input beyond the specification of which order parameter to promote, pointing the way to the design of synthesis protocols by artificial intelligence.

Generative Adversarial Networks for Crystal Structure Prediction
Sungwon Kim, Juhwan Noh, Geun Ho Gu, Alan Aspuru-Guzik, and Yousung Jung ACS Cent. Sci. 2020, 6, 1412−1420

ABSTRACT: The constant demand for novel functional materials calls for efficient strategies to accelerate the materials discovery, and crystal structure prediction is one of
the most fundamental tasks along that direction. In addressing this challenge, generative models can offer new opportunities since they allow for the continuous navigation of chemical space via latent spaces. In this work, we employ a crystal representation that is inversion-free based on unit cell and fractional atomic coordinates and build a generative adversarial network for crystal structures. The proposed model is applied to generate the Mg−Mn−O ternary materials with the theoretical evaluation of their photoanode properties for high-throughput virtual screening (HTVS). The proposed generative HTVS framework predicts 23 new crystal structures with reasonable calculated stability and band gap. These findings suggest that the generative model can be an effective way to explore hidden portions of the chemical space, an area that is usually unreachable when conventional substitutionbased discovery is employed.

いくつかの論文を眺めてみて、Deep Reinforcement Learning (DRL) が自然科学においても非常に重要な択割を果たしており、急激に拡がりつつあることがわかった。

A. Geronさんのテキストに戻って、DRLを学ぼう。

5月24日（月）

Bristol-Myers Squibb：781 teams, 10 days to go

ケロッピアバターのGMのスレッドを眺めてみた。

関連情報収集力、集約力、再構成力、等々に感心するばかり。

Reinforcement Learning：A. Geronさんのテキストを用いた学習

Introduction to OpenAI Gym：食わず嫌いだったOpenAI Gym

agentのトレーニングには、working environmentが必要。そのためのsimulated environmentを提供するのがOpenAI Gym (Atari games, board games, 2D and 3D physical simulations, and so on)。

さて、テキストでは、OpenAI Gymをインストールする方法が説明されている。

次に、CartPoleを動かしながらの説明が続いている。

import gym

env = gym.make("CartPole-v1")

obs = env.reset( )

obs

array([-0.01258566, -0.00156614, 0.04207708, -0.00180545])

CartPileは、2D physical simulationということのようだ。

obsは1D NumPy arrayで、4つの数字が、obs[0]: cart's horizontal position (0.0 = center), obs[1]: its velocity (positive means right), obs[2]: the angle of the pole (0.0 = vertical), and obs[3]: its angular velocity (positive means clockwise)の順に並んでいる。

CartPoleなんて面白くもなんともないから気が入らない。と思いながらしぶしぶテキストを眺めている。

>>> env.action_space

Discrete(2)

actionは2通り、integerで0と1、0はleftに加速、1はrightに加速。

>>> action = 1 # accelerate right

>>> obs, reward, done, info = env.step(action)

>>> obs

array([-0.01261699, 0.19292789, 0.04204097, -0.028092127])

>>> reward

1.0

>>> done

False

>>> info

{ }

The step( ) method exacutes the given action and returns four values:

obs:

This is the new observation. The cart is now moving toward the right (obs[1] > 0). The pole is still tilted toward the right (obs[2] >0), but its angular velosity is now negative (obs[3] <0), so it will likely be tilted toward the left after the next step.

reward:

In this environment, you get a reward of 1.0 at every step, no matter what you do, so that the goal is to keep the episode running as long as possible.

done:

This value will be True when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps (in this last case, you have won). After that, the environment must be reset before it can be used again.

info:

This environment-specific dictionary can provide some extra information that you may find useful for debugging or for training. For example, in some games it may indicate how many lives the agent has.

まずは、単純なポリシーの場合：ポールが右側に倒れていたらcartを右側に加速し、ポールが左側に倒れていたらcartを左側に加速するというもの。

def basic_policy(obs):

angle = obs[2] # angle of the pole

return 0 if angle < 0 else 1

totals = [ ]

for episode in range(500):

episode_rewarda = 0

obs = env.reset( )

for step in range(200):

action = basic_policy(obs)

obs, reward, done, info = env.step(action)

episode_rewards += reward

if done:

break

totals.append(episode_rewards)

これを実行してみると、次のような結果となった。

>>> import numpy as np

>>> np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(41.718, 8.858356280936096, 24.0, 68.0)

poleが倒れなければ（ポールの傾きがある設定角度以下、かつ、カートが画面の中にある場合）、1 stepあたりrewardは1.0である。結果を見るとポールの保持に成功した平均step回数は41.7、標準偏差は8.9、最小ステップ数は24、最大ステップ数は68となったということである。

68 step以上の間、ポールを保持できなかったということである。neural networkを使えばpolicyを改善することができるということが次に示される。

Neural Network Policies

いよいよニューラルネットワークが登場する。ニューラルネットワークにobservationを入力し、実行するactionを出力する。このとき出力されるのはactionの確率である。この確率に従ってactionをランダムに決める。

5月25日（火）

Bristol-Myers Squibb：782 teams, 9 days to go

今日も、ケロッピアバターのGMのスレッドを眺めてみた。使われている単語の意味がわからないものばかり。いくつかを調べてみた。

Fairseq is a sequence modeling toolkit written in PyTorch that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

pre-norm activation transformer:

Transformers without Tears: Improving the Normalization of Self-Attention

Toan Q. Nguyen and Julian Salazar, arXiv:1910.05895v2 [cs.CL] 30 Dec 2019

Abstract
We evaluate three simple, normalizationcentric changes to improve Transformer
training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose `2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT '15 EnglishVietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the highresource setting (WMT '14 English-German), SCALENORM and FIXNORM remain competitive but PRENORM degrades performance.

Transformerトレーニングを改善するために、3つの単純な正規化中心の変更を評価します。まず、ノルム前の残余接続（PRENORM）と小さな初期化により、ウォームアップのない検証ベースのトレーニングが大きな学習率で可能になることを示します。次に、トレーニングを高速化し、パフォーマンスを向上させるために、単一のスケールパラメータ（SCALENORM）を使用した `2正規化を提案します。最後に、単語の埋め込みを固定長（FIXNORM）に正規化することの有効性を再確認します。 TED Talksベースのコーパスからの5つの低リソース翻訳ペアでは、これらの変更は常に収束し、最先端のバイリンガルベースラインを平均+1.1 BLEU、IWSLT '15EnglishVietnameseで新しい32.8BLEUを提供します。よりシャープなパフォーマンス曲線、より一貫性のある勾配基準、およびアクティベーションスケーリングとデコーダー深度の間の線形関係を観察します。驚いたことに、高リソース設定（WMT '14英語-ドイツ語）では、SCALENORMとFIXNORMは引き続き競争力がありますが、PRENORMはパフォーマンスを低下させます。by Google翻訳

次の2つの論文は、基礎知識。

Show and Tell: A Neural Image Caption Generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan
arXiv:1411.4555v2 [cs.CV] 20 Apr 2015

Abstract
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this
paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the
current state-of-the-art.

画像の内容を自動的に記述することは、コンピュータービジョンと自然言語処理を結び付ける人工知能の基本的な問題です。この論文では、コンピュータビジョンと機械翻訳の最近の進歩を組み合わせ、画像を説明する自然な文を生成するために使用できる、ディープリカレントアーキテクチャに基づく生成モデルを提示します。モデルは、トレーニング画像が与えられた場合にターゲットの説明文の可能性を最大化するようにトレーニングされます。いくつかのデータセットでの実験は、モデルの正確さと、画像の説明からのみ学習する言語の流暢さを示しています。私たちのモデルはしばしば非常に正確であり、定性的および定量的に検証します。たとえば、Pascalデータセットの現在の最先端のBLEU-1スコア（高いほど良い）は25ですが、私たちのアプローチでは59が得られ、69前後の人間のパフォーマンスと比較されます。BLEU-1も示しています。 Flickr30kのスコアが56から66に、SBUのスコアが19から28に向上しました。最後に、新しくリリースされたCOCOデータセットで、現在の最先端のBLEU-4である27.7を達成しました。by Google翻訳

f:id:AI_ML_DL:20210525120808p:plain

f:id:AI_ML_DL:20210525120945p:plain

f:id:AI_ML_DL:20210525121037p:plain

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu et al., arXiv:1502.03044v3 [cs.LG] 19 Apr 2016

Abstract
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower
bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

機械翻訳とオブジェクト検出の最近の研究に触発されて、画像の内容を説明することを自動的に学習する注意ベースのモデルを紹介します。標準的なバックプロパゲーション手法を使用して、変分下限を最大化することにより確率的にこのモデルを決定論的にトレーニングする方法について説明します。また、視覚化を通じて、出力シーケンスで対応する単語を生成しながら、モデルが顕著なオブジェクトを注視することを自動的に学習する方法を示します。 Flickr8k、Flickr30k、MS COCOの3つのベンチマークデータセットで、最新のパフォーマンスを使用してattentionの使用を検証します。　　　　　　by Google翻訳

f:id:AI_ML_DL:20210525122337p:plain

f:id:AI_ML_DL:20210525122541p:plain

Reinforcement Learning
Neural Network Policies

tf.kerasを用いたneural network policyの構築例

inport tensorflow as tf

from tensorflow inport keras

n_inputs = 4 # == env.observation_space.shape[0]

model = keras.models.Sequential([

keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),

keras.layers.Dense(1, activation="sigmoid"),

])

入力層のユニット数は4で、入力するのは、observationsで、position, velocity, angle, angular velocityの4種類である。

hidden layerのユニット数は、簡単な課題なので5としている。活性化関数は"elu"。

出力は、action 0 (left)の確率だけなのでユニット数は1。活性化関数は、出力が確率なので"sigmoid"。

これでneural network policyが構築できた。

入力ユニット数が4、隠れ層のユニット数が5、出力層が1のニューラルネットワークの出来上がり。

先のhardcodeとの違いを認識することが重要：hardcodeは、angle（poleの傾斜角）の極性からactionを決めた。これに対してneural network policyは、台車の位置（0.0が画面の中央）、台車の速度（右方向を正）、ポールの傾斜角（垂直が0.0で時計方向＝右に膾炙している場合が正）、角速度（≒ポールが倒れる速さ、時計回りが正）の4つの値の非線形関数から推定した確率からactionを決める。

どうやってtrainingするのだろうか。

Evaluating Actions: The Credit Assighnment Problem（貢献度分配問題）

通常のsupervised learningはできない。先に示したhardcodeのプログラムで、平均以上のステップ数までポールを保持できたエピソードと平均以下のステップ数しか保持できなかった場合について、何をどう学ばせることができるのか。ポールが傾いている方向に台車を加速しているだけなのでより良い条件をどのようにして見つけるのだろうか。

discount factorなるものを導入して思考実験している。積算するrewardの回数と減衰の大きさをコントロールする因子のようである。discount factorが0.95では、13ステップでrewardが半分に、discount factorが0.99では69ステップでrewardが半分になる。

次にgood actionとbad actionが含まれる割合を比較することによってaction advantageを定義すれば、各actionをpositive advantageとnrgative advantageに分類することができる。これで、actionの評価(evaluation)ができたということになるようだ。

理解不十分につき、文章がつながっていないし、意味不明。要検討。

Figure 18-6. Computing an action's return: the sum of discounted future rewardsが、理解できないので、ここで、停止している。

この節の最後は次のように締めくくられている。

Perfect-now that we have a way to evaluate each action, we are ready to train our first agent using policy gradients. Let's see how.

5月26日（水）

Bristol-Myers Squibb：791 teams, 8 days to go

今日は、Deep Reinfoecement Learningの学習を優先しよう。

Deep Reinfoecement Learning

Evaluating Actions: The Credit Assighnment Problem：

本節の1行目から読み直そう。

日本語に訳してもよくわからないので、テキストから抜き書きする。抜き書きでなく、全文になりそうな気配。なぜなら、1行でもとばしたら意味がわからなくなりそうだ。

If we knew what the best action was at each step, we could train the neural network as usual, by minimizing the cross entropy between the estimated probability distribution and the target probability distribution.

It would just be regular supervised learning.

However, in Reinforcement Learning the only guidance the agent gets is through rewards, and rewards are typically sparce and delayed.

For example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad?

All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible.

This is called the credit assignment problem: when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it.

Think of a dog that gets rewarded hours after it behaved well: will it understand what it is being rewarded for?

cart上のpoleが倒れないようにcartを左右に動かす。2次元の大道芸だな。

To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor ɤ (gamma) at each step.　discount gactorの意味、役割がわからんな。

This sum of discounted rewards is called the action's return. ここでのactionは、1 stepではなく、一連のstepの集合体を指しているようだ。

consider the example in Figure 18-6.

If an agent decides to go right three times in a row and gets +10 reward after the first step, 0 after the second step, and finally -50 after the third step, then assume we use a discount factor ɤ=0.8, the first action will have a return of 10 + ɤ x 0 + ɤ^2 x (-50) = -22.

ここで頭が混乱する。p.616のrewardの説明では、全てのステップで1.0のrewardを得ると書いてある。ところが、ここでは、rightへのstepを3回繰り返し、最初のstepから10のrewars、2番目のstepから0のreward、3番目のstepから-50のrewardを得ると書いてある。よく考えよう。

5月27日（木）

Bristol-Myers Squibb：799 teams, 7 days to go

あと1週間になった。終わってから猛勉強だな。

画像からテキストへの変換だが、分子構造画像からテキスト、図形からテキスト、数式からテキスト、関数からテキスト、方程式からテキスト、物理の理論式からテキスト、化学反応式からテキスト、タンパクの構造式からテキスト、・・・

人の五感に相当するセンサーを備え、自律的に学習し、行動する人工人間を、デザインしてみよう。目標はナンバーファイブだ。

WIKIPEDIA: Short Circuit (1986 film)

Reinforcement Learning：A. Geronさんのテキスト第18章: Reinforcement Learning

If the discount factor is close to 0, then future rewards won't count for much compared to immediate rewards.　

Conversely, if the discount factor is close to 1, then rewards far into the future will count almost as much as immediate rewards.　

Typical discount factors vary from 0.9 to 0.99.

With a discount factor of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards (since 0.95^13≒0.5), while with a discount factor of 0.99, rewards 69 steps into the future count for half as much as immediate rewards.　

In the CartPole environment, actions have fairly short-term effects, so choosing a discount factor of 0.95 seems reasonable.　

Figure 18-6. Computing an action's return: the sum of discounted future rewards

Of cource, a good action may be followed by several bad actions that cause the pole to fall quickly, resulting in the good action getting a low return (similarly, a good actor may sometimes star in a terrible movie).

However, if we play the game enough times, on average good actions will get a higher return than bad ones.

We want to estimate how much better or worse an action is, compared to the other possible actions, on average.

This is called the action average.

For this, we must run many episodes and normalize all the action returns (by subtracting the mean and dividing by the standard deviation).

After that, we can reasonably assume that actions with a negative advantage were bad while actions with a positive advantage were good.

Perfect - now that we have a way to evaluate each action, we are ready to train our first agent using polycy gradients.

Let's see how.

Policy Gradients

As discusses earlier, PG algorithms optimize the parameters of a polycy by following the gradients toward higher rewards.

One popular class of GP algorithms, called REINFORCE algorithms, was introduced back in 1992 (https://homl.info/132) by Ronald Williams.

Here is one common variant:

5月28日（金）

Bristol-Myers Squibb：806 teams, 6 days to go

進捗なし。

5月29日（土）

Bristol-Myers Squibb：816 teams, 6 days to go

進捗なし。

5月31日（月）

Bristol-Myers Squibb：831 teams, 4 days to go

Bristol-Myers Squibbをメインに進めていたが、うまくいかず、途中でReinforcement Learningの勉強に切り替える等、何をやってるのかわからなくなった。

trainの公開コードを走らせ、チューニングすることで、少しでもコードを理解しよう。

公開コードを走らせる前にAttentionとTransformをきちんと理解しようと思って記事を書いていたのだが、途中で、ブラウザがリセットされ、書きかけの文章が全て消えてしまった。

斎藤康毅著ゼロから作るDeep Learning 2 自然言語処理編 8章　Attentionから少し引用させていただこう。

8.5.2 Transformer

　私たちはこれまで、RNN（LSTM）をいたるところで使用してきました。言語モデルに始まり、文章生成、seq2seq、そしてAttection付きseq2seqと、その構成要素には必ずRNNが登場しました。そして、このRNNによって、可変長の時系列データはうまく処理され、（多くの場合）良い結果を得ることができます。しかし、RNNにも欠点があります。その欠点のひとつに、並列化処理が挙げられます。

　RNNは、前時刻に計算した結果を用いて逐次的に計算を行います。そのため、RNNの計算を、時間方向で並列的に計算することは（基本的には）できません。この点は、ディープラーニングの計算がGPUを使った並列計算の環境で行われることを想定すると、大きなボトルネックになります。そこで、RNNを避けたいというモチベーションが生まれます。

　そのような背景から、現在ではRNNを取り除く研究（もしくは並列計算可能なRNNの研究）が活発に行われています。これは『Attention is all you need』というタイトルの論文で提案された手法です。そのタイトルが示すとおり、RNNではなくAttentionを使って処理します。ここでは、このTransformerについて簡単に見ていきます。

　TransformerはAttentionによって構成されますが、その中でもSelf-Attentionというテクニックが利用される点が重要なポイントです。このSelf- Attentionは直訳すれば「自分自身に対してのAttention」ということになります。つまりこれは、ひとつの時系列データを対象としたAttentionであり、ひとつの時系列データ内において各要素が他の要素に対してどのような関連性があるのかを見ていこうというものです。私たちのTime Attentionレイヤを使って説明すると、Self-Attentionは図8-37のように書けます。

　これまで私たちは「翻訳」のような2つの時系列データ間の対応関係をAttentionで求めてきました。このときTime Attentionレイヤへの2本の入力には、図8-37の左図で示すように、異なる2つの時系列データ（hs_enc, hs_dec）が入力されます。これに対してSelf-Attentionは、図8-37の右図に示すように、2本の入力線にひとつの時系列データ（hs）が入力されます。そうすることで、ひとつの時系列データ内において各要素間の対応関係が求められます。

　Self-Attentionの説明が済んだので、続いてTransformerのレイヤ構成を見ていただきましょう。Transformerの構成は図8-38のようになります。

　Transformerでは、RNNの代わりにAttentionが使われます。実際に図8-38を見ると、EncoderとDecoderの両者でSelf-Attentionが使われていることがわかります。なお、図8-38のFeed Forwardレイヤは、フィードフォワードのネットワーク（時間方向に独立して処理するネットワーク）を表します。具体的には、隠れ層が1層で活性化関数にReLUを用いた全結合のニューラルネットワークが用いられています。また、図中に「Nx」とありますが、これは背景がグレーで囲まれた要素をN回積み重ねることを意味します。

　このTransformerを用いることで、計算量を抑え、GPUによる並列計算の恩恵をより多く受けることができます。その成果として、TransformerはGNMTに比べて学習時間を大幅に減らすことに成功しました。さらに翻訳精度の点でも、図8-39が示すように、精度向上を実現しました。

　図8-39では、3つの手法が比較されています。その結果は、GNMTよりも「畳み込み層を用いたseq2seq」（図中ではConvS2Sで表記）が精度が高く、さらにTransformerはそれをも上回っています。このように、Attentionは、計算量だけでなく、精度の観点からも有望な技術であることがわかります。

　わたしたちは、これまでAttentionをRNNと組み合わせて利用してきました。しかし、ここでの研究が示唆するように、AttentionはRNNを置き換えるモジュールとしても利用できるのです。これによって、さらにAttentionの利用機会が増えていくかもしれません。　

　以上が、斎藤康毅著ゼロから作るDeep Learning 2 自然言語処理編 8章　AttentionからのTransformerに関する引用。

Bristol-Myers Squibbのtrainの公開コードは、明日走らせてみる予定。

6月1日（火）

Bristol-Myers Squibb：841 teams, 3 days to go

50位のスコアが1.42とは、凄い人達が集まっているんだ。

公開コード（train）を走らせてみる：

2週間くらいKaggleでcodeを走らせていなかったので、少し、とまどった。

いま走らせているコード”Tensorflow TPU Training Baseline LB 16.92”の動作状況とコードを見ていて思うのは、この2週間を無駄に過ごしたということ。2週間前には、このコードや、このコードの作者が参考にしたコードを学ぼうと計画していたのに、コンペの大通りを離れて脇道に入り、reinforcement learning広場で2週間も遊んでしまったことが悔やまれる。

TPUを使うには、データセットをTFRecordに書き換えるところから始める必要がある。

”Advanced Image Cleaning and TFRecord Generation”

画像を綺麗にする（点状のノイズ消去、線の欠けの修正）こと、画像の不要な外側の空白領域を削除すること、および、TFRecordへの変換を行っている。

画像のサイズを決めるのも容易ではない。作成者はつぎのように記述している。

Choosing an appropriate image size is difficult, as complex molecules will need a high resolution to preserve details, but training on 2.4 million in high resolution is unfeasable. The chosen resolution of 256*448 should preserve enough detail and allow for training on a TPU within the 3 hours limit.　

240万件の画像データを3時間以内に処理しなければならないという条件を満たしたうえで、画像解像度を決めなければならない。trainコードを見ると、1エポックで2.4Mではなく512kのデータをtrainingしている。このとき、何も考えずにやると、1部のデータしか使わないことになる。公開コードに対する質疑から、全データをtrainingできるように途中で改善したとのこと。これを見て、HuBMAPコンペでは、全部のtrainデータを使えず、スコアが上がらなかったのを思い出した。このときは、train_dataを読み込んでからtrainingをスタートする必要があり、メモリーオーバーで1/3程度のtrain_dataが読み込めなかった。しかし、自分でも、train_dataに偏りがあり、全てのtrain_dataを使わないと予測精度が上がらないことは常識なので、対応策を絞り出すべきだったと思う。たとえば、train_dataの2/3の読み込み時間は3分間くらいだから、10エポックのtrainingであれば、エポック毎に異なる組み合わせのtrain_dataを3分かけて読み込んでも30分くらい長くなるだけで全ての訓練データを使うことができて、スコアアップを図れた可能性がある。

画像を綺麗にする（点状のノイズ消去、線の欠けの修正）には、たとえば次のようなコードを用いている。

# single pixel width horizontal line with 1 pixel missing
kernel_h_single_mono = pad_kernel( a, a, a, -1, a, a, a , max_pad=1)

# single pixel width horizontal line with 3 pixels missing
kernel_h_single_triple = pad_kernel( a, a, a, -1, -1, -1, a, a, a , max_pad=1)

kernel_h_multi = pad_kernel([
[ a, a, a, a, a, a, a ],
[ a, a, a,-1, a, a, a ],
[ a, a, a, a, a, a, a ],
], max_pad=1)

こういうのをANNでやりたいと思ったのだが、こういう技術も使えるようになっておくことは重要だと思う。

InChIコードに対して、inchi_intを定義しているようだが、コードを読むことができない。

遊び半分でエポック数を10、20、30と変えて訓練し、訓練済みのパラメータをデータセットに入れて、inferenceプログラムを走らせると、LBスコアは、それぞれ、11.36、8.23、7.85となった。

これ以上どうなるものでもなさそうだが、とりあえず、あと2日間、discussionや他の公開コードを眺めながら、スコアアップの方法を試してみる。

6月2日（水）

Bristol-Myers Squibb：855 teams, 2 days to go

バッチサイズを512から1024にしてみよう。

学習率lrを2e-3～1e-4から1e-3～1e-5にしてみよう。

LBスコアは、9.19と悪化した。

明日は、最終日だが、これで終わる。

このコンペは、失敗であった。

6月4日（金）

先ほど、Bristol-Myers Squibbコンペの最終結果が出た。コードコンペではないので、結果が即日確定！順位変動も殆どなし。

反省の弁

１．コンペに参加する意義は、スコアアップ以外にはない。途中でreinforcement learningに向かったのは、generativeなモデルがスコアアップに必要だと感じたためであるが、reinforcement learningに向かったのは失敗であった。しかも、reinforcement learningの学習中は、集中力を欠いていた。

２．課題が自分の頭では解けなかった（分子構造式とInChIコードの対応関係を最後まで把握できなかった）ために、モデル構築の方向性を見出せず、スコアアップの方向に舵を切ることができなかった。コンペのdiscussionや公開コードをよく見て対策を考えるべきだった。

３．金や銀を意識することは、モチベーションを上げるために重要であるが、今回は、メダル圏内に入ることすら早々にあきらめてしまったために、あらぬ方向に行ってしまい、公開コードのチューニングすら殆ど行わなかった。けろっぴアバターの方が公開していたコードを試すことは、やるべきだったと思う。

f:id:AI_ML_DL:20210510234705p:plain — style=165, iteration=500

f:id:AI_ML_DL:20210510235029p:plain — style=165, iteration=50

f:id:AI_ML_DL:20210510235136p:plain — style=165, iteration=5

2021-03-01

Kaggle散歩（1st March to 10th May 2021）

3月1日（月）

HuBMAP：1,130 teams, two months to go

今日はBatch_sizeの16と32に違いがあるかどうかを調べる。

EfficientNetB5-UnetのLB=0.836のコードがBatch_size=16だったので、これを32に変更したがGPUのメモリーオーバーとなったため、B4で実行した：結果はLB=0.836となった。

これで改善が認められれば他のDecoderでもやってみるつもりだったが、・・・。

次は、Encoderのpre-training weights（'imagenet'を常用している）によって違いがあるかどうかを調べる。

'noisy-student'：LB=0.831：0.005下がった。

'advprop'：LB=0.823：0.013下がった。

こういうのは、実際に使ってみないとわからないものだ。ただし、エポック数など、最適化できていないパラメータに起因して、スコアが低い、という可能性もある。

次は、pre-trained weightsではなく、random initializationでやってみよう。

A. Geron氏のテキストには、Glorot and He initialization, LeCun initialization, Xavier initializationなどが紹介されている。今使っているSegmentation Modelsでは、これらの initializationには対応していないようである。

random initialization(epochs=30)：LB=0.819 ：意外に健闘しているようにみえる。

今日のデータはすべて30エポックだが、random initializationについては、明日、エポック数を増やしてみよう。

3月2日（火）

HuBMAP：1,142 teams, two months to go

random initialization(epochs=50)：LB=0.778：期待外れ。これ以上追求しない。

（lr_schedulerは、OneCycleLRを使い続けているので、エポック数は、OneCycleLRの設定エポック数である。このスケジューラーは、early_stopで監視する必要が無く、極端なoverfittingにもならないようで、便利だが、Max_lrとエポック数の最適化は必須。合間に最適化を試みているが、まだ使いこなせていないように感じている。）

ここまで、"An activation function to apply after the final convolution layer. "をNoneにして、 score > 0、で評価してきたが、activation functionを"sigmoid"にして、score > thresholdとして、thresholdの最適化をやってみる。

ちょっとやってみたが、うまくいかない。何かおかしい。間違ったことをしているかもしれない・・・。

best single modelというタイトルのdiscussionにおいて、LB=0.84+から0.86+くらいのスコアのモデル（encoder+decoder）、optimizer、loss function、epochsなど概要が紹介されている。自分の結果、0.83+とは最大で0.03くらい違う。前処理、後処理などの情報が少ないので何を参考にすればよいのかわからないが、気になるのは、スコアが高い0.85+の4件が、EncoderにEfficientNetを使っていないことと、開発年代の古いEncoderを用いて、エポック数を多くしていることである。さらに、resnet34-UnetでLB=0.857というスコアを得ているのは驚きだ。FPNは1件で、そのほかはUnetを使っている。

Googleが昨年発表した論文"EfficientDet: Scalable and Efficient Object Detection"で提案している新しいモデルEfficientDetは、FPNから派生したもののようである"we propose a weighted bi-directional feature pyramid network (BiFPN)"。

3月3日（水）

HuBMAPに集中している間に、いろいろなコンペが通り過ぎていった。今日、新しいテーマのコンペ案内メールが来ていた。化学系には興味があるので調べてみる。

Bristol-Myers Squibb – Molecular Translation
Can you translate chemical images to text?

chemical imagesというのは構造式、textというのは、InChIのこと。構造式は画像として与えられる。InCHIについて、wikipediaの説明は以下の通り。

InChI(International Chemical Identifier)は、標準的かつ人間が読める方法で分子情報を提供し、またウェブ上でのデータベースからの情報の検索機能を提供する。元々、2000年から2005年にIUPACとNISTによって開発され、フォーマットとアルゴリズムは非営利であり、開発の継続は、IUPACも参画する非営利団体のInChI Trustにより、2010年までサポートされていた。現在の1.04版は、2011年9月にリリースされた。

1.04版の前までは、ソフトウェアはオープンソースのGNU Lesser General Public Licenseで無償で入手できたが[3]、現在は、IUPAC-InChI Trust Licenseと呼ばれる固有のライセンスとなっている[4]。ウイキペディアより引用。

構造式を含む文献を電子化する場合に必要となる技術の1つかもしれないが、InChIでは3次元構造は原理的に表現しきれないようであり、InChI表記は無味乾燥であり、機械学習で推測したInChIに学術的価値がどれだけあるのかも理解できないので、やってみようという気にならない。構造式の画像から近似的に3次元分子構造モデルを再現するということであれば楽しく取り組めそうだが、・・・。

HuBMAP：1,155 teams, two months to go（データ修正のため順延中）

昨日の取り組み「activation functionを"sigmoid"にして、score > thresholdとして、thresholdの最適化をやってみる。」のつづき。

threshold=0.5で、BCELossとFocalLossで10エポックの場合について調べた。

B5-Unet, BCELoss, AdamW, 10 epochs：LB=0.822 (output_file_ size=5.1 MB)

B5-Unet, FocalLoss, AdamW, 10 epochs：LB=0.799 (output_file_ size=4.94 MB)

いずれも、出力データのサイズは4.9-5.1 MBとなり、sigmoidを使う前よりも小さい。sigmoidを使わずにLB=0.836となったときの出力データのサイズは、5.5-5.7 MBであった。データサイズとLBスコアとの間には必ずしも比例関係は無いが、今回の場合は、thresholdを下げればLBスコアが上がる可能性がありそうである。

GPUの割当は使い果たしたので、この確認は土曜日以降になる。

activation functionとしては、 “sigmoid”の他に, “softmax”, “logsoftmax”, “tanh”,などが使える。"tanh"は -1 < tanh < 1で、中央値付近の微分係数がsigmoidの4倍になるため、sigmoidよりも収束が速くなりそうだし、感度も高くなるかもしれないので、試してみよう。

3月4日（木）

HuBMAP： 1,167 teams, two months to go（データ更新待ち）

GPUが使えないのでCPUで計算する。：B5はメモリーオーバーになるので、計算時間も考えて、B0を使うことにする。B0-UnetでのLBスコアは0.827前後である。

最初にtanhを試す。

B0-Unet(tanh: threshold=0.0)-E10-B16 : LB=0.827

tanhも使えることがわかった。出力データは5.64 MBで、sigmoidの場合よりも大きい。

LBスコアは、sigmoidを使った場合よりも高くなっている。

次に、sigmoidで、thresholdの効果を試す。1サイクルに6時間くらいかかる（GPUなら60分くらい）ので、以下の3つの計算の結果が出そろうのは明日になる。

B0-Unet(sigmoid: threshold=0.50)-E10-B16: LB=0.819（このスコアで閾値を調整しても、とは思うのだが、何事も経験しておくことは大切だ、ということで、・・・。）

B0-Unet(sigmoid: threshold=0.45)-E10-B16: LB=0.820

B0-Unet(sigmoid: threshold=0.40)-E10-B16: LB=0.821

threshold依存性はありそうだ。しかし、このスコアでは、使う気にならない。

semantic segmentationについてもっと勉強しよう。

Deep Semantic Segmentation of Natural and Medical Images: A Review
Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Cohen, Julien Cohen-Adad and Ghassan Hamarneh, arXiv:1910.07655v3 [cs.CV] 3 Jun 2020,

f:id:AI_ML_DL:20210305000412p:plain

f:id:AI_ML_DL:20210305000510p:plain

Cross Entropy --> Weighted Cross Entropy --> Focal Loss (Lin et al. (2017b) added the term (1 − pˆ)γ to the cross entropy loss)

Overlap Measure based Loss Functions：Dice Loss / F1 Score --> Tversky Loss(Tversky loss (TL) (Salehi et al., 2017) is a generalization of the DL(Dice Loss) --> Exponential Logarithmic Loss --> Lovasz-Softmax loss(a smooth extension of the discrete Jaccard loss(IoU loss))

上記のような種々の損失関数の解説の最後に次のように書かれている。

The loss functions which use cross-entropy as the base and the overlap measure functions as a weighted regularizer show more stability during training.

cross-entropy系をベースに、overlap系を加えることが推奨されている。Fig.12左側のcross-entropyとfocalの曲線は、相補的な特性を示しているように見える。これも組み合わせると良い、ということになるのだろうか。

3月5日（金）

HuBMAP：1,176 teams, two months to go（データ更新待ち）

シングルモデルでLB=0.85+となることを目標に、DeepLabV3+に集中してみよう。

DeepLabV3+の性能を発揮させるためには、パラメータを適正に設定しなければならない筈なので、論文を丁寧に読んで、モデルの中身を理解しなければならない。

ここまで、Unet以外の8種類のモデルのどれもが、Unetのスコアを超えられないのは、各モデルのパラメータを適切に設定することができていないからなのだろうと思う。

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam
Google Inc. ECCV 2018

Abstract.

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial
information.

In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution
to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

空間ピラミッドプーリングモジュールまたはエンコードデコーダー構造は、セマンティックセグメンテーションタスクのディープニューラルネットワークで使用されます。前者のネットワークは、フィルターまたは複数のレートと複数の有効視野でのプール操作を使用して着信機能をプローブすることにより、マルチスケールのコンテキスト情報をエンコードできます。後者のネットワークは、空間情報を徐々に回復することにより、より鋭いオブジェクト境界をキャプチャできます。

この作業では、両方の方法の利点を組み合わせることを提案します。具体的には、提案されたモデルであるDeepLabv3 +は、シンプルでありながら効果的なデコーダーモジュールを追加することで、DeepLabv3を拡張し、特にオブジェクトの境界に沿ってセグメンテーション結果を改善します。 Xceptionモデルをさらに調査し、深さ方向に分離可能な畳み込みをAtrous Spatial Pyramid Poolingとデコーダーモジュールの両方に適用して、より高速で強力なエンコーダー-デコーダーネットワークを実現します。

以上、google翻訳

3月6日（土）

GPU: 43h

HuBMAP：1,182 teams, two months to go（データ更新待ち）

HuBMAPコンペは、糸球体を細胞群から抽出するという単純な作業なんだが、糸球体の周辺/背景は、似たような細胞で満たされている。内部と外部は類似しているので、境界層の識別能力を高めることが重要なのだろうと思うのだが、どうやったら実現できるのか。上記のabstractに、DeepLabV3+は境界層の識別能力を改善したと書かれているので期待して使ってみよう。

問題は、計算時間で、EncoderにEfficientNetB3を使っても、1エポックに8分以上かかる。他のDecoderでは2分以内である。CPUでの計算時間は1エポックが20分くらいで、他のモデルと変わらないので、他のモデルよりもCPUへの負荷が大きいということのようだが、よくわからない。

DeepLabV3+の論文では、EncoderにXceptionが使われているので、EfficientNetからXceptionに変えてみた。しかし、意外なことに、Xceptionに対応していないというメッセージが現れ、停止した。あらためて論文を見ると、Xceptionはモディファイして用いていると書いてある。

論文には、Xceptionよりスコアは低いが、ResNet-101を用いて計算した例も示されているので、EncoderをResNet-101に変更して計算してみた。CPUでは1エポックに30分くらいかかり、さすがにこれでは使えないのでGPUに切り替えてみた。そうすると、これも意外なことに、1エポックあたりの計算時間が、約60秒であった。これなら問題なく条件検討ができる。

計算結果：LB=0.811：BCE+0.5*Focal

256x256を同じ視野で解像度を384x384にすると、LB=0.816になった。

Atrous convolution：設定パラメータは"decoder_atrous_rates"となっていて、デフォルトは (12, 24, 36)である。開発者の論文では (6, 12, 18) となっている。

encoder_output_stride：デフォルトの16以外の設定32や8を試しているが、詳細不明

decoder_channels：256をデフォルトにして、128ではスコアが下がるという記述はあるが、系統的に調べたというふうでもない。

計算の合間に論文を見ていたのだが、納得できる説明が見当たらない。上記のReviewで目立っていたので、このモデルを中心に据えて、じっくりと、取り組もうと思って始めてみたが、原著者らによるモデルの説明があいまいで、あまり可能性を感じないし、デフォルトで計算したLBスコアは良くないし、何よりも、DecoderにEfficientNetが使えない（GPUを使ったときの計算時間が長すぎる）など、メリットが感じられないので、DeepLabV3+は、このへんでやめておこうかと思う。

"COCO"が気になって調べてみた。

Microsoft COCO: Common Objects in Context
Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick
James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick Piotr Dollar´

arXiv:1405.0312v3 [cs.CV] 21 Feb 2015
Abstract—We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complexeveryday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

f:id:AI_ML_DL:20210306094212p:plain

COCOデータセットを使ったコンペが継続的に開催されている。

Leaderboard: Detection | Keypoints | Stuff | Panoptic | Captions

Panoptic segmentation aims to unify the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). Existing metrics are specialized for either semantic or instance segmentation and cannot be used to evaluate the joint task involving both stuff and thing classes. Rather than using a heuristic combination of disjoint metrics for the two tasks, the panoptic task introduces a new Panoptic Quality (PQ) metric. PQ evaluates performance for all categories, including both stuff and thing categories, in a unified manner.

Panoptic Leaderboardのトップのチームのpaper

Joint COCO and Mapillary Workshop at ICCV 2019: Panoptic Segmentation Challenge Track Technical Report: Explore Context Relation for Panoptic Segmentation
Shen Wang1,2∗ Tao Liu1,3∗ Huanyu Liu1∗ Yuchen Ma1 Zeming Li1 Zhicheng Wang1
Xinyu Zhou1 Gang Yu1 Erjin Zhou1 Xiangyu Zhang1 Jian Sun

f:id:AI_ML_DL:20210306112638p:plain

これくらいの論文を書いてみたいものだ。

さて、LB=0.836のモデル、EfficientNetB5-Unet、に戻って、シングルモデルのチューニングしてみよう。損失関数と画像分解能などで、LB=84+を目指そう。

3月7日（日）

HuBMAP：1,189 teams, two months to go（データ更新待ち）

LB=0.836を超えるために：

モデル：Unetは、"decoder_attention_type"に"scse"を設定することによって、Attention moduleを組み込んだUnetになる。scseの効果を系統的に比較できていないので、scseの効果を見極めたい。

損失関数：Diceをベースに、BCEやJaccardとの組み合わの効果を調べる。

スライス（1024x1024）とダウンサイズ（256x256）：ダウンサイズの際の解像度を最大で512x512まで上げることを検討する。モデルサイズやバッチサイズの変更の効果と相殺されるので、可能な範囲で調べる。

モデル：種類もサイズも、まったく、最適化できていない。いろいろ調べたつもりだが、encode-decoderの組み合わせごとにある程度パラメータフィッティングしないと、判断を誤る可能性が高い。

バッチサイズ：モデルサイズと相反するので選択範囲は狭くなる。

Run1：scseの効果：LB=0.832：0.004下がった。：train中のloss（train_loss及びval_loss）は、scseを使う前よりも下がっているのでスコアは上がると期待していたのだが、どういうことだろうか。

Run2：scseは使わないで、解像度を256x256から320x320に変更：LB=0.834：0.002下がった。こちらも、train中のloss（train_loss及びval_loss）は、解像度を上げる前よりも下がっているのだが、どうなっているのだろうか。

Run1もRun2もoverfittingしている可能性が高い。Encoderを小さくするか、エポック数を少なくする必要があると思う。

次は、損失関数だが、上記のReviewには、「The loss functions which use cross-entropy as the base and the overlap measure functions as a weighted regularizer show more stability during training.」と書かれていた。今使っているものでいえば、overlap系はDice, Jaccard, Lovaszとなる。試してみるしかないのだろう。MAnetの論文では、BCEとDiceの比率は1対4であったように思う。Diceをベースに、BCEを加えている。LB=0.836を得ているのは全て、DiceLossのみなので、手始めに、MAnetの論文をまねてみるのもよいかもしれない。

と言いつつ、さきほど、Run2のDiceLossを、5つの損失関数を全て足したものに置き換えて、trainingした。その結果、val_lossがtrain_lossの1.75倍にもなってしまった。overfitting間違いなしだ。Diceが最も収束が遅いことはわかっていたのだから、エポック数は、少なくとも、30から20くらいまでは下げておくべきだったかもしれない。train_lossは、個々の損失関数のtrain_lossの和に近い値となっているので、正しく動作していることは確認できた。

Run3：Run2のDiceを、Dice + BCE + Lovasz + Jaccard + Focalとした：LB=0.828：train_lossとval_lossとの乖離が大きかった（1対1.75）ので、overfittingになり、もっと低いスコアになると予想していたので、このスコアには多少驚いた。

明日は、Run1, Run2, Run3のエポック数をいずれも30から25に減らしてみよう。さらに、20に減らしてみよう。Run1もRun2もスコアは上がるはずだと思っている。Run3は、エポック数を減らすとともに、学習速度が他よりも速いBCEとFocalの割合（寄与率）を下げてみよう（たとえば：Dice + Lovasz + Jaccard + BCE/2 + Focal/2）。

3月8日（月）

HuBMAP：1,200 teams, 18 days to go (two months to go)

Run4：Unet ---> Unet_scseの効果を調べる：エポック数を30から25に減らしてみる。：LB=0.828：さらに0.004下がった。（0.836 -> 0.832 -> 0.828）

25エポックでの最終エポックのlossは、30エポックでの最終エポックのlossよりも大きいだけでなく、scseを使う前よりも大きな値になった。30エポックで、scseを使った場合、lossが下がったので、LBスコアは上がると予測したが、スコアは下がった。30エポックでは、overfittingになったと思ったのだが、そうではなかったのか。

振り返ると、モデルだけ変更した場合（例えばB5からB6に変更）、train_lossもval_lossも下がって、LBスコアが上がると期待するのだが、スコアが下がることもときどきあった。そういうときは、この課題にはB5が適切なのだろうと判断して先に進んできた。ここに問題/課題がありそうだが、・・・。

Run5：25エポックに下げてだめだったのだから、35エポックについても調べる必要がある。増やすという選択肢の他に、30エポックに戻して、収束の速いBCEやFocalと組み合わせる、という方法も考えられる。配合比をどうするか。まずは、30エポックに戻して、（Dice + BCE）を試してみよう。：LB=0.822：また下がった。

Run6：すなおにDiceのみ, 35エポックを試す。：LB=0.836：scse無しのスコアと同じだが、ほんの少し、前進できたような気がする。

Run7：では、40エポックにしてみよう。：LB=0.833：だめか。scseのメリットを見出せず。

次は、画像解像度の効果について調べる。320x320で検討するつもりだったが、現状のモデルとバッチ数ではこれ以上解像度を上げることができないので、512x512まで使える条件（モデルを小さく、バッチ数を少なく）で検討してみようと思う。

Bristol-Myers Squibb – Molecular Translation：138 teams, 3 months to go
Can you translate chemical images to text?

方針転換：参加しないと何も学べない --> 参加すれば、新たなデータと課題に触れる機会が得られ、新たな前処理手法、予測手法を学ぶことができる可能性がある。 --> 参加させていただくことにしよう。

以下、ウイキペディアより、

エタノール：CH3CH2OH：InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 (standard InChI)

全てのInChIは、”InChI=”という文字列から始まり、現在は1であるバージョンの数が続く。standard InChIでは、これにSの文字が続く。残りの情報は、レイヤーとサブレイヤーの配列として構造化され、各々のレイヤーは、1つの種類の情報を収める。レイヤーとサブレイヤーは、区切り文字 ”/” で隔てられ、（メインレイヤーの化学式サブレイヤーを除き）固有の接頭文字で始まる。6つのレイヤーと各々の重要なサブレイヤーは、以下の通りである。

1. メインレイヤー

化学式（接頭文字なし） - 全てのInChIに現れる唯一のサブレイヤー
原子の繋がり（接頭文字:”c”） - 化学式中の水素原子以外の原子には番号が付与される。このサブレイヤーでは、原子が他のどの原子と結合されているかを記述する。
水素原子（接頭文字:”h”） - 各々の原子にいくつの水素原子が結合しているかを記述する。

f:id:AI_ML_DL:20210308223005p:plain

L-アスコルビン酸

InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1 (standard InChI)

3. 立体化学レイヤー

二重結合とクムレン（接頭文字:”b”）
原子の四面体配置とアレーン（接頭文字:”t”, ”m”）
立体化学の種類の情報（接頭文字:”s”）

構造式からInChI式を推測させようとするこの課題は、非常にレベルが高いと感じる。

さらに、コンペで提供される構造式の画像は、上記アスコルビン酸の構造式のように明瞭なものではない。

課題は、構造式の画像とInChI式（ラベル）を用いてモデルを訓練することにより、構造式からInChI式を予測すること。

数字の並びは、IUPACの規則に従う。

3月9日（火）

HuBMAP：1,205 teams, two months to go（データ更新作業開始）

Run8：EfficientNetB5-Unet, batch=8, 512x512

現在午前11時4分、submitできない。Discussionをみると、新しいデータが届き、更新作業に入ったとのこと。

We've got a new private test set incoming. Some of the old test samples will be moved to train. The process is currently underway and may take some time. Submissions are disabled while this is taking place. We will notify you when the update is complete. Thank you for your patience!

BMS：149 teams, 3 months to go

画像からテキストへの変換だから、image-captioningのモデルを使うことができるようである。イメージを文字で表現する。分子構造式から、元素の種類と、数と、分子の形状と、官能基の結合位置と水素の結合位置を、InChIの表現形式に則って配置する。

captioningといっても、言葉ではなく、元素記号、結合位置を表す数字やハイフン、丸カッコ、意味内容の区切るスラッシュ、などが1列に並んだものである。アスコルビン酸だと、C, O, Hの3種類だが、N, P, S, Cl, F, Brなどの元素もあり、結合も1重、2重、3重があり、4員環、5員環、6員環、7員環、鎖状、なども区別して表示される。

3月10日（水）

HuBMAP：1,205 teams, 2 months to go（データのアップデート完了）

さて、再始動だ！と思ってプログラムを動かしたら、早速、エラーが発生した。そう、train/に、マスク無しのデータが存在する（旧test data）ために生じたエラーだ。

借り物のコードを、部分的にしか理解できていない状態で使っているので、エラー解消は容易ではないが、コードを理解する良い機会だ。

＊＊＊別件作業中＊＊＊

3月11日（木）

＊＊＊別件作業中＊＊＊

HuBMAP：1,210 teams, 2 months to go : May 10, 2021 - Final submission deadline.

データセットが更新されることによって、マスクの不具合が修正され、train_dataが増えたことから、LBスコアが跳ね上がったようだ。今の1位はLB=0.921。休眠状態だったつわものたちが本格的に動き始めるとさらに上がっていくのだろう。

さて、データ修正前の値だが、自分の0.836では勝負にならない。望みがあるとすれば、現在LB=0.90+の方が使用しているモデルがEfficientNetB4-Unetで解像度が512x512だということで、特殊なモデルを使っているわけではなく、自分も同様の条件で試したことがあるというところだ。しかし、それは、参加者みんながわかっていることなので、そこまでは到達して当たり前みたいな感じだとすると、やはり、現状では勝ち目はない。

別件を済ませて、これに集中したいのだが、・・・。

3月12日（金）

HuBMAP：1,211 teams, 2 months to go (gold: 0.916+)

train_dataに更新前のtest_dataが追加されたという情報から、勝手に、マスク無しのtrain_dataとして追加されたと思い込んでいたが、マスクデータは全てのtrain_dataにセットされている。したがって、エラーの原因はマスクではない。

image = dataset.read([1,2,3], <---ここでエラーが生じているようだが、・・・。

IndexError: band index 2 out of range (not in (1,))

よくわからんが、次の3種類のフォーマットが混在しているために、エラーになっているのだろうと思う。

[height, width, channel]

[channel, height, width]

[1, 1, channel, height, width]

ここからは、さらに踏み込んで、コードを、書き換えなければだめだろう。

とりあえず、エラーが生じる直前までの8個のtrain_dataだけを用いることができるようにして、計算させてみた。

プログラムが最後まで走るのか、さらに、commit, submitを経て、LBスコアが得られるのかは全く分からないが、train_lossは、データ更新前には見られなかったくらい小さくなっている。そのぶん、極端なoverfittingになっているようにみえる。

trainからinferenceに移ったら、trainのときと同様のエラーが発生した。データの読み込みを、データフォーマットに合わせるように、プログラムしなおす必要がありそうだ。

trainの場合は、フォーマットの違うtrain_dataを読み込まないようにしたのだが、inferenceの場合は、フォーマットの違うtest_dataを避けたらスコアが正しく計算されなくなってしまう。

データ更新前に投稿したものは、更新されたデータを使って再計算されることになっているようだが、自分のプログラムは、更新されたデータには対応していないので、再計算されないように思う。

train_modelを読み込んで、inferenceだけKaggle kernelを使っているチームの場合、Kaggleスタッフでは、trainのやりなおしができないので、あまり、意味がないような気がする。

更新前のデータに対して動いていたのだから、更新されたデータに対しても動いてくれなければ困る、と言いたいところだが、だれも言わないだろうな。それくらいはできないとだめですよ、と言われそうだ。優しいスタッフが、更新されたデータに対応できないチームのために、変換コードを用意してくれるかもしれない。

いやいや、スタッフが乗り出す前に、Kaggler仲間が、解決策を提案してくれている。

参加者どうしで助けあっている！

これが、Kaggleだ！

3月13日（土）

HuBMAP：1,213 teams, 2 months to go（現在のLBスコア1位：0.926）

train_dataとしては、15個のうちの8個を使っているだけだが、データ更新前と比べて、train_lossが70%くらいになっているようである。val_lossも相応に下がればうれしいのだが、逆に大きくなっているので、明らかに、overfittingの状態になっている。

マスクの精度が向上していると思うので、train_lossが下がるのは想定どおりであるが、教師データに学んだ結果が、validation_dataに正しく反映されないというのはどういうことなのか。ふつうに、overfittingということで片付けておく。

transformの程度を強くしてみたら、val_lossが明らかに小さくなった。これは良い兆候である。

前処理で、データの切り出しで、MIN_OVERLAPという値があり、これによってスコアが大きく変わったというコメントがあったので、試してみることにする。

切り出しの際の重なりを大きくするということは、データ数を増やしていることになるのだろうと思う。

初期値の32から、64, 128, 192と変えてみた。MIN_OVERLAP=192では、RAMの使用量が約2.5GB増えた。

結果は、若干、overfitting側に変化しているようにみえる。

いずれにしても、現状では、train_dataの全体を読み込むことができず、さらに、もっと困ったことには、肝心のtest_dataが読み込めない。これでは、先に進めない。

明日は、Discussionと公開コードを参考にして、フォーマットが変わってしまったtrain_dataとtest_dataの読み込みができるようにする予定。

overfitting、underfitting、スコアアップ: A. Geron氏のテキスト参照（chapter 7: Ensemble Learning and Random Forests）：実験第一！

・エポック数（early stopping, OneCycleLRのエポック数）

・モデルの大きさ：B0, B1, B2, B3, B4, B5, B6, B7, B8, 18, 34, 50, 101, 152, 121, 169, 201：パラメータの数を増やせば、より詳細なところまで見えるが、overfittingしやすくなり、汎用性は低下することがある。

・データ量（augmentationによる増量）

・augmentationの種類、変形強度（ElasticTransform, GlidDistortion, OpticalDistortion）

・dropout（ユニット、層、画像内部）

・ensemble（K-fold、stacking, voting, randomsampling,

・モデルの初期値（random, Glorot and He, LeCun, Xavier）

・pretrained_model（データの種類、量、質）

・out of bag evaluation

・random patches, random subspaces

from today's Twitter:

Deep Learning Weekly@dl_weekly
From this week's issue: Algorithms are meaningless without good data. The public can exploit that to demand change.

3月14日（日）

HuBMAP：1,219 teams, 2 months to go（Goldは0.92+）

今日は、Discussionと公開コードを参考にして、フォーマットの違うものが含まれているtrain_dataとtest_dataを読み込むことができるようにする。それができれば、データ更新後の初LBスコアを取得する。

公開コードを参考にして、作業を始めた。自分がベースにしているコードの作者をフォローしようと思って、そのコードを見に行ったら、2日前に更新されており、新たに加わったフォーマットに対応できるように変更されていた。有難い。しかも、そのコードは非常にスマートで、たった1行追加されているだけである。

ということで、更新データに、対応できるようになった。ただし、新たな課題が生じた。train_dataが増えたのはありがたいが、RAMに読み込んで動かしているため、13GBの制限に引っかかって、半分強くらいしか使えない。

さらに、train_lossとval_lossの差が大きい。更新データに対応したコードの製作者も、同様な結果になっていて、更新前のデータに対して使っていたパラメータのままでは、LBスコアは、かえって、悪くなる。

EfficientNetB5-Unetを試してみたが、今日は、LB=0.512が最大であった。いくらなんでもこれは無いだろう、と思った。リーダーボードのスコアがどんどん上がっているので、更新前のデータに対して用いていたのと同じ条件で計算しても、自分のスコアも、当然、上がるものだと思い込んでいた。

今日は、submitできるようになって、良かった。

しかし、明日から、overfittingとの戦いだ！

3月15日（月）

HuBMAP：1,223 teams

今頃は、LB=0.85+だと思ったのだが、更新データでの最良値は、なんと、LB=0.531である。原因は、train_dataの半分弱のデータを使っていないことにあるのかもしれない。

更新前と同じ8件のtrain_dataを使っているのだが、更新前と同じデータは5件あるが、3件は異なっている。

15件全部のtrain_dataを使いたいところだが、今の使い方ではメモリーオーバーになるので、まずは、更新前と同じデータ名のtrain_data：8件を使ってみよう。

更新前のrain_dataのみを読み込むようにコードを書き替えた。次のようなコードを追加して、新しいtrain_dataのファイルを読み飛ばすようにした。

if filename == '26dc41664':
continue

その結果、データ更新前のモデルをそのまま使っても、train中の極端なoverfittingの傾向はなくなった。こんなにも違う結果になるとは、思わなかった。

さらに、train_lossもval_lossもデータ更新前よりも、かなり小さくなったので、データ更新前よりも、良いスコアが得られるのではないかと期待したが、LB=0.831となり、残念ながら、データ更新前のLB=0.836には届かなかった。

これで、ようやく、スタートラインに立てたように思う。

test_dataに類似したtrain_dataを使えば、スコアが良くなるのと同様に、test_dataと似ていないtrain_dataを使った場合にはスコアは上がらない、ということがおきていたということだろうと思う。試験範囲と外れたところを勉強しても良い点は取れない。

こういうことを経験すると、漫然と訓練していても良いモデルは作れない、という気がしてきた。test_dataを適切に処理できるためにどのように訓練しておけば良いのか。テストデータがわかっていれば、訓練データとの類似性や特徴などを調べておくことが重要になりそうだ。Kaggleのコンペがスタートすると、データ解析をしてみせて、訓練データとテストデータの特徴に違いがある場合には、Discussionで注意喚起し、その理由や対処方法まで丁寧に説明してくださるKagglerもおられる。

多値分類だと、分類するクラス毎の訓練データ数を同じになるようにするのだが、バックグラウンドとの識別の場合には、バックグラウンドの面積が何倍にもなる場合がある。そういうとき、segmentationの精度にはどんな影響があるのだろうか。精度に影響する因子にどのようなものがあって、どのように対処すればよい結果が得られるのだろうか。

明日は、

１．コードの学習：画像の前処理

２．LB=0.831を超えよう！

3月16日（火）

HuBMAP：1,232 teams (May 10, 2021 - Final submission deadline.)

１．コードの学習：3時間程度の予定：Resnet-Unetのようなモデルだが、DecoderにはFPNが使われているようだった。FPNのチャンネル数やUnetのチャンネル数の設定値を見ていて、自分もUnetとFPNを使っているが、デフォルトのままで、自分で設定したことがない。Unetのデフォルトはdecoder_channels=(256, 128, 64, 32, 16)となっているが、公開コードでは、もっと多くのチャンネル数が用いられているように見える。FPNにしても、デフォルトは、decoder_pyramid_channels=256, decoder_segmentation_channels=128となっているが、公開コードでは512チャンネルも使われているように見える。このあたりのパラメータを変えてみて、性能がどう変化するかを調べてみようと思う。

２．更新前のtrain_dataの8件に、新たに追加されたtrain_dataの7件から、メモリー一杯まで読み込んでtrainingすることによって、LBスコアが上がるかどうか調べる。

2件までしか追加できなかったが、とりあえず、train and inferenceを試してみることにする。

train_data=10件：LB=0.839になった。(EfficientNetB4-Unet-DiceLoss)

15件のtrain_dataを全て使うにはどうすれば良いのだろうか。

今使っているプログラムの良いところは、パラメータを変えてタイルを自由に作り変えることができるところだ。全てのtrain_dataを使えるようにすることを最優先に、タイルサイズを小さくし、オーバーラップをゼロにすることによって、15件のtrain_dataの全てを使うことができるようになった。

タイルサイズを小さくすると、1枚のタイルに含まれる糸球体が少なくなり、糸球体の多くが分割されてしまうのではないかと思う。このことが、糸球体のセグメンテーションにおいてどう作用するのだろう。

タイルサイズを小さくして計算した結果、出力ファイルのサイズが標準の５MBを大きく超えて、7.5MBとか21MBになってしまい、前者はエラー、後者はLB=0.675となった。commit中にもタイルサイズに関する警告が出ていたので、これは1024に戻すことを優先しようと思う。

画像の分割は、非常に重要な作業のように思う。全ての糸球体の外周を分断することなく、完全な形のままで、画像を切り出す方法を調べてみよう。

明日は、これ（train_data=10件：LB=0.839になった。(EfficientNetB4-Unet-DiceLoss)）を超える方法を検討しよう。その１つの手段として、Unetのdecoder_channelsの変更を行ってみる。

（それにしても、上位との差は歴然！：最終日、トップは0.95+になっていて、自分は0.85+に到達して喜んでいるかもしれない！：あと50日間でこの差を埋めて逆転するためにはどうすればよいのかを考えよう）

3月17日（水）

HuBMAP： 1,236 teams (May 10 final submission deadline)

１．コードの学習

@iafoss氏の公開コード：class UneXt50(nn.Module)というものを構築している。パーツとしてFPN, UnetBlock, ASPPが使われている。encoderはResNet50で、decoderはUnetだが、ASPPとFPNが仕込まれている。

高度な機能を組み合わせることで、高度なモデルができるということを示している。

２．Unetのdecoder_channelsの効果

デフォルトのdecoder_channels=(256, 128, 64, 32, 16)を、2倍、decoder_channels=(512, 256, 128, 64, 32)にしてみた。出力データサイズは4.9 MBから5.34 MBに増えた。ノイズが増えたように見える。結果はLB=0.815であった。チャンネル数を増やせば保持できる情報量が増えてデメリットは無いように思うが、overfittingという落とし穴がある。

次は、チャンネル数を半分、decoder_channels=(128, 64, 32, 16, 8)にしてみた。出力データサイズは10.66 MBに増えた。ノイズはさらに増えているように見える。結果は、LB=0.231であった。チャンネル数は各深さにおける特徴量を保持するためのメモリーのようなものだと思うので、メモリーが少なくて情報を保持できないということだろうと思う。train_lossとval_lossの値からは、ここまでスコアが悪くなるとは思わなかった。非常に驚いている。

もう少し遊んでみよう。チャンネル数を減少ではなく増大させるとどうなるのだろう。decoder_channels=(16, 32, 64, 128, 256)としてみると、出力データサイズは6 MBで、ノイズが多そうなデータにみえたが、LB=0.798と、思ったほどはひどくはなかった。

ならば、平均値あたりに揃えたらどうなるのか。decoder_channels=(128, 128, 128, 128, 128)としてみた。LB=0.821

オリジナルの論文を見ると、チャンネル数はこんなもんじゃない。最低が64だから、５段であれば、decoder_channels=(1024, 512, 256, 128, 64)となる。

バカなことして遊んでいる場合じゃなかった。

このセットdecoder_channels=(1024, 512, 256, 128, 64)で計算してみたら、悪くはなかった。明日、submitしてみよう。：LB=0.813：

実は、このセット(512, 256, 128, 64, 32)の次に1024のセットを準備していたが、512のセットのスコアが期待外れだったので、遊びに転じてしまった。

以下に結果をまとめて示す：

decoder_channels=(128, 64, 32, 16, 8)：LB=0.231

decoder_channels=(256, 128, 64, 32, 16)：LB=0.831

decoder_channels=(512, 256, 128, 64, 32)：LB=0.815

decoder_channels=(1024, 512, 256, 128, 64)：LB=0.813

decoder_channels=(16, 32, 64, 128, 256)：LB=0.798

decoder_channels=(128, 128, 128, 128, 128)：LB=0.821

decoder_channels=(256, 256, 256, 256, 256)：LB=0.790

この結果から何か言うとすれば、「いろいろ試してみれば、良い組み合わせが見つかるかもしれない」。

U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox

f:id:AI_ML_DL:20210317165706p:plain

このUnetのオリジナル論文を検索しているときに、Unetの新しいバージョンに出くわした。

UNET 3+: A FULL-SCALE CONNECTED UNET FOR MEDICAL IMAGE SEGMENTATION
Huimin Huang 1, *Lanfen Lin1, Ruofeng Tong1, *Hongjie Hu2, Qiaowei Zhang2, Yutaro Iwamoto3, Xianhua Han3, *Yen-Wei Chen3,4,1, Jian Wu1

f:id:AI_ML_DL:20210317182919p:plain

このUnet 3+のコードは、GitHubにある：https://github.com/ZJUGiveLab/UNet-Version

新たにコードを開発した研究者や技術者は、開発したコードをGitHubで公開することが多く、論文のabstractの下端に公開サイトのURLを示すことが多い。

3月18日（木）

HuBMAT：1,245 teams

今日は、Unet 3+を使ってみよう。https://github.com/ZJUGiveLab/UNet-Versionの、UNet-Version/models/には、次のコードが入っている。

UNet.py
UNet_2Plus.py
UNet_3Plus.py
init_weights.py
layers.py

さらに、UNet-Version/loss/には、次のコードが入っている。

bceLoss.py
iouLoss.py
msssimLoss.py

論文では、hybrid segmentation lossとして、focal lossとMS-SSIM lossとIOU lossを組み合わせたhybrid segmentation lossが提案されている。

UNet_3Plus.pyには、UNet_3Plusの他に、Unet_3Plus_DeepSupとUnet_3Plus_DeepSup_CGMが含まれている。それぞれ、ベースコード、with deep supervision、with deep supervision and class-guided moduleであり、論文中で追加機能として説明されているものである。

道具は揃った。

さて、このUnet_3Plus_DeepSup_CGMを使うにはどうすればよいのか。

まず、簡単なところから始めよう。

１．ライセンスの確認：

https://github.com/ZJUGiveLab/UNet-Versionのreadmeには、次の記述があるだけなので、このURLと論文を引用すれば十分ではないかと思う。

README.md
UNet 3+
Code for ICASSP 2020 paper ‘UNet 3+: A full-scale connected unet for medical image segmentation’

Requirements
python 3.6.2
pytorch 1.3.1

２．UNet.pyを使うための準備（最終的にはUnet_3Plus_DeepSup_CGMを使ってみたいが、使用方法を検討しやすい小さいプログラムを使う）

Unet.pyの最初に、次のコードが書かれている。

import torch
import torch.nn as nn
import torch.nn.functional as F
from layers import unetConv2, unetUp, unetUp_origin
from init_weights import init_weights
from torchvision import models
import numpy as np

UNet.pyをnotebookにコピペして走らせてみると、次のメッセージが出て停止した。

ModuleNotFoundError: No module named 'layers'

from layers import unetConv2, unetUp, unetUp_origin：ここでひっかかっている。

'layers.py'と'init_weight.py'の2つのモジュールを、カレントディレクトリに置かなければならないようだ。

Pythonのドキュメントの、 6. モジュール、に次の説明がある。

モジュールは Python の定義や文が入ったファイルです。ファイル名はモジュール名に接尾語 .py がついたものになります。モジュールの中では、(文字列の) モジュール名をグローバル変数 __name__ で取得できます。例えば、お気に入りのテキストエディタを使って、現在のディレクトリに以下の内容のファイル fibo.py を作成してみましょう:

次に Python インタプリタに入り、モジュールを以下のコマンドで import しましょう:

つまり、UNet.pyが参照している関数が含まれている"layers.py"と"init_weights.py"をKaggle kernelのカレントディレクトリ―"/kaggle/working/"に入れておけば、importによって呼び出すことができるということのようだ。

そのためには、"layers.py"と"init_weights.py"をKaggleのデータセットに入れて、コードのinput_dataにアップロードすればよい。そうして、次のコードでカレントディレクトリ―にコピーする。

!cp ../input/unet-3plus-pytorch/models/init_weights.py /kaggle/working/
!cp ../input/unet-3plus-pytorch/models/layers.py /kaggle/working/

形式的には、こんな感じだが、これで、

from layers import unetConv2, unetUp, unetUp_origin
from init_weights import init_weights

この2つはエラーなく実行されたようである。

３．UNetを使う

これで、https://github.com/ZJUGiveLab/UNet-Versionの中のUNetを使う準備は整ったと思う。

model = UNet( )

model.to(DEVICE)

loss_fnには、 nn.BCEWithLogitsLoss( )を使ったのだが、どうも、うまく動作していないようである。lossが下がらない、すなわち学習が進まない。

とりあえず、先に進もう。

UNet_3Plus.pyは、ベースとなるUNet_3Plusの他に、オプションを付加した、Unet_3Plus_DeepSupとUnet_3Plus_DeepSup_CGMがある。

UNet_3Plusを動かしてみた。batch_size=16ではメモリーオーバーになった、batch_size=8ではGPUのRAMは14.7 GBで動いた。1エポックに7分以上かかっており、lossがわずかづつしか下がらず。

Unet_3Plus_DeepSupとUnet_3Plus_DeepSup_CGMは、GPUのRAMはそれぞれ11.4 GBと12.5 GBでUNet_3Plusより少ないが、いずれも次のようなエラーメッセージが現れて停止した。

AttributeError: 'tuple' object has no attribute 'size'

loss = loss_fn(output, target)：この計算の過程でエラーが生じているようである。

とりあえず、オプションなしのUNet_3Plusは動くようになったので、引き続き検討していこう。lossの変化が小さいのは、出口の活性化関数'sigmoid'によるものだろうと思う。他のモデルでも’sigmoid'を使うとlossが減少しにくくなり、LBスコアも0.01くらい下がるので、ここに何か問題がありそうだと思っている。

明日は、LB=0.916の公開コードに学ぼう。

何か本質的なところで、自分のtraining方法（もしくはtrainingコード）に本質的な間違いがあるような気がしている。

3月19日（金）

HuBMAP：1,251 teams, to May 10

LB=0.916の公開コードに学ぶ：

EfficientNetB2-Unet or FPN：パラメータは activation=sigmoid 以外デフォルト

optimizer=Adam,

loss=bce_dice or bce_jaccard,

以上の条件は、自分が調べた範囲に、ほぼ、含まれている。これらの条件では、自分は、LB=0.83+止まりであった。

ReduceLROnPlateau, EarlyStopping,

これらを適切に使えば、自分が使っているOneCycleLRよりも0.01~0.02くらいは高くなる可能性はあるかもしれない。（3月25日修正：OneCycleLRはすぐれもので、適切に使えば上回る可能性もある）

KFold, GroupKFold

これも適切に使えば、0.01~0.02くらい上がるかもしれない。

TTA

これも、0.01~0.02くらい高くなる可能性があるだろうか。TTAを+4でやってみたときには、+0.005程度、上がったことがある。

augmentation(albumentations)

メジャーなものは同じなので、差がつくとしても0.01以下であろうと思うが（3月25日追記/修正：HuBMAPコンペでは強すぎる幾何学的変形は性能を下げる。カラー調整やボケ具合やコントラストは適切に用いると0.02+の向上は可能である）、あまり使われていないものの中には、よいものがあるかもしれない。複数のブロックで画像を隠す方法（3月25日追記：aibumentationに"cutout"というのがある）を使ってみたいと思っているのだがまだ手が出せないでいる。

以上の４つの効果を足すと、0.04~0.07ほど高くなる可能性があり、0.83を足すと、0.87~0.90くらいになる可能性があるということになるが・・・。

もっと、具体的なコードの中身まで検討するつもりだったが、明日以降の宿題とする。

明日からの1週間は、EfficientNetB2-Unetから離れないで、チューニングしてみようと思う。

3月20日（土）～3月26日（金）：GPU 41ｈ

HuBMAP：EfficientNetB2-Unetに固定してチューニングを行ってみよう。

Unetは、activation="sigmoid"にして、他はデフォルトを用いる。

上記のLB=0.916の公開コードでは、loss=bce_dice or bce_jaccard, となっているが、コードをよく見ると、bce_weight=1となっているので、BCE一択のようである。これは、自分の最近の経験と一致する。すなわち、DiceLossとJaccardLossは、activationを"sigmoid"にすると、自分が使っているコードでは、全く機能しない。

BCEと組み合わせるなら、FocalLossやLovaszLossであろう。まずは、BCE単独で進めてみよう。SoftBCEWithLogitsLoss( )

optimizerは、AdamWをOneCycleLRとの組み合わせで使っていて、上記LB=0.916のコードで使われているAdamとは、weight_decayがデフォルトかどうかの違いだけで、大差ないように思う。

OneCycleLRは、比較のために、他のlr_schedulerも使ってみようかと思う。PyTorchは、種類が豊富で迷ってしまう。

簡単に使えて、有用だと思うもの。

torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)

>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05 if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 60
>>> # lr = 0.0005 if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>> train(...)
>>> validate(...)
>>> scheduler.step()

これは、step_size 毎に、lrが、lr x gammaになる。

ReduceLROnPlateau

val_lossをモニターして、変動幅が設定値より小さくなると、lr x gammaを行う方法で、StepLRを自動化したようなもの。

val_acc、val_dice等をモニタしながらlrを変更することもできる。

EarlyStopping,

元のプログラムを書き替えるときに、その場しのぎの書き換えをやったことで、エラーは出ないが、動作は、正しくないということがおきているかもしれない。桁落ちの問題はなかったかな。（TPUの桁落ち対策のことを思い出した）

3月20日（土）

LB=0.911の公開コードのtrain_codeに学ぶ。（max: LB=0.918)

EfficientNetB4-Unet（activationはデフォルトのNoneか？）

loss_fn=FocalLoss(alpha=0.25, gamma=2.0, reduction='mean')、この設定（alpha=0.25）はMAnetの論文で見たような気がする。

optimizer=ranger

Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) in one optimizer.

EfficientNetB2-Unet：

Unetにactivation="sigmoid"とし、OneCycleLR(AdamW, max_lr=1e-3. epochs=20...), SoftBCEWithLogitsLoss, batch_size=16, train_dataは、データ更新前の8件を用いて計算した結果、LB=0.808となった。最終エポックにおいて、train_loss=0.668, val_loss=0.669となり、14エポック以降は殆ど変化していない。

さすがにこれ（LB=0.808）では、続ける気にならないので、activationを外すことにして、他は何も変えずに計算してsubmitしてみたら、LB=0.850となった。驚いた！！！

最終エポックでtrain_loss=0.0219, val_loss=0.0356となっており、これらは約1.6倍異なっているので、overfittingにしかみえず、これまでであれば、commit, submitしなかったと思う。しかし、今回は、これをスタートラインにしようと思っていたので、commit, submitした。これが、自己ベスト（LB=0.839）を0.011も上回ったことは非常に大きな驚きである。train中に Dice coefficientを計算して表示していれば、もっと早く、このレベルの結果が得られていたかもしれない。

ここで、train_dataを2件追加（CPUの許容範囲ギリギリ）し、他の条件は変えずに計算してcommit, submitしたところ、LB=0.865までスコアが上がった。なにも変わったことはしていないので、なぜ上がっているのかわからない。データが更新された後、更新データが使えるようになったときにtrain_dataを8件から10件に増やしたときには、LBスコアが0.83+から0.839に上がっていたので、今回も上がることは期待していたが、ここまで（0.015も）上がるとは思わなかった。

しかし、トップの、LB=0.928まで、あと、0.063もある。

明日は、OneCycleLRのmax_lrを1e-3の付近で変えてみよう。今日、max_lr=2.5e-4を試してみたら、LB=0.853となった。ちょっと下げすぎたかもしれない。max_lr=5e-4, 2.5e-3, 5e-3あたりを調べてみる。さらに、FocalLossとLovaszLossを試してみようと思う。

3月21日（日）

OneCycleLRのmax_lrを変えた時のLBスコアを調べてみた。

lrは、デフォルト設定でepochs=20の場合、max_lr/25からスタートし、6エポックの計算直後にmax_lrに達し、そこからlrは減衰し、最終的にmax_lr/25/1e4となるようである。

結果は次のようになった。数値だけではわかりにくいので、プロットしてみる。

max_lr=2.5e-4：LB=0.853

max_lr=5.0e-4：LB=0.866

max_lr=1.0e-3：LB=0.865

max_lr=2.5e-3：LB=0.855

max_lr=5.0e-3：LB=0.863

プロットすると下図のようになる。LBスコアはmax\lrが5e-4と1e-3の間で最大になりそうだが、5e-3より大きい値にするとどうなるのかも気になる。

f:id:AI_ML_DL:20210321105826p:plain

f:id:AI_ML_DL:20210321104528p:plain

この挙動は、エポック数によっても変わる可能性がある。

max_lrを小さくすると、全過程においてlrは小さいので、収束速度が遅くなると推測される。もしそうであれば、max_lr=2.5e-4の条件でエポック数を増やせばLBスコアはもっと高くなるはずである。

max_lrが大きい場合、6エポック前後でのlrが標準的な値を超えているときには、不安定な動きをしているようにみえる。減衰しはじめてから標準的なlr値になるまでに時間がかかるため、標準的なlr値に戻ってから最終エポックまでの計算量が少なくなるために、学習が不十分という状況が生じているかもしれない。

OneCycleLRのエポック数を増やしても、lrの変化は相似形になるので、max_lrが大きい場合にエポック数を増やすと、適正なlrになるエポック数は増えるが、適正値を超えるlrになっているエポック数も増える。といっても、適正値を超えるlrになっているエポック数が増えることが、最終的な学習結果にどう影響するかはわからない。

すべては、やってみないとわからない、としておこう。

細かすぎるような気もするが（目標は遥か彼方）、次の２つの条件を追加してみる。

max_lr=7.5e-4：LB=0.857

max_lr=1.0e-2：LB=0.858

f:id:AI_ML_DL:20210321153650p:plain

f:id:AI_ML_DL:20210321153724p:plain

期待しかなかったのだが、笑うしかない。

次は、1サイクルではなく、複数サイクルのスケジューラーを使ってみよう。

torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1, verbose=False)

f:id:AI_ML_DL:20210323215518p:plain
ちょっと動かしてみただけだが、optimizerとしてAdamを使おうとしたら、momentumが使えないoptimizerは使えない、というようなメッセージが現れて停止した。そのため、SGDを nesterov=Trueとして使ってみた。base_lr=0.1, max_lr=3, epochs=20に設定し、他はデフォルトで使ってみたら、20エポックでは1サイクルの途中までだった。そこで、20エポックで3サイクルになるようにstep_size_up=500（デフォルトは2000）として計算してみたら、lr=3付近で異常値を示した。想像の域を出ないが、急激なlrの増大が良くなかったのではないかと思う。max_lrを3から2に下げたら、正常に動き、最終の20エポック目のlossもそれなりの値を示した。

サイクルの繰り返しによって、val_lossが下がっているので、max_lrとstep_size_upとエポック数の最適な組み合わせを探してみよう。

3月22日（月）

lr_scheduler.CyclicLR、20エポックで、max_lr=2とし、1サイクル、2サイクル、3サイクルになるようにstep_size_upを調整し、計算した結果をsubmitした。結果は以下のようになった。カッコの中は、最終エポックのtrain_lossとval_lossである。

３サイクル：LB=0.869 : (0.267, 0.314)

２サイクル：LB=0.839 : (0.258, 0.301)

１サイクル：LB=0.856 : (0.243, 0.296)

自己ベストが出たのはよかったが、lossの値からは、サイクル数が少ないほどLBスコアは高くなると予想していたので、あてが外れてしまった。

これならいけるだろうと思って意気込んで計算してsubmitしたのが次の結果である。

30エポック：3サイクル：max_lr=3：LB=0.835 : (0.253, 0.353)：val_lossが高すぎ！

最初から、base_lr=0.1, max_lr=3 : 20エポック＆3サイクルと決めていたが、20エポック＆3サイクルでは、max_le=3に設定すると、lr=3付近でval_lossが発散した。そこで、max_lr=3でも発散しない条件を調べて、30エポックであれば良い結果が得られそうだと思ってトライした。val_lossが0.353となり、かなり大きい値だが、29エポック目では、(0.251, 0.303)だったので、いけるかもしれないと思い、commit, submitした。残念。

ステップ数を正しく計算していなかったために、30エポックの最後は最少のlrが適用されるべきところが、base_lrからの立ち上がりの30ステップぶんのlrが適用されたために学習した重みが若干かく乱された可能性がある。少なくともステップ数は正しく設定しておくべきであった。（745とすべきところを740と設定していた）

他の条件についても正確に計算すると、偶然だが、LB=0.869となったものだけが、最終エポックがlrの減少中に終わっていた。494と入力するつもりのところに、500と入力されていたのが幸いしたのかもしれない。LB=0.835となったものは、max_lr=3であったことも、悪い方に作用したようである。

OneCycleLRを使っているときに、max_lr=3が頭にこびりついていて、今回のCyclicLRでも、base_lr=0.1, max_lr=3がベストと思い込んでしまっていたようだ。Leslie N. Smith氏の論文では、0.1-0.5, 0.1-1, 0.1-1.5, 0.1-2, 0.1-2.5, 0.1-3, 0.1-3.5, 0.1-4, 0.09-0.9など、いろいろ試した例が表に載っている。

Leslie N. Smith氏の2編の論文より、

Furthermore, there is a certain elegance to the rhythm of these cycles and it simplifies the decision of when to drop learning rates and when to stop the current training
run. Experiments show that replacing each step of a constant learning rate with at least 3 cycles trains the network weights most of the way and running for 4 or more cycles
will achieve even better performance. Also, it is best to stop training at the end of a cycle, which is when the learning rate is at the minimum value and the accuracy peaks.

3サイクル以上とすること（追記：要検討）、および、lrが最小値になるエポックで止めること、と書かれている。自分が気付いた注意点としては、ステップ数を正しく計算し、最終エポックの計算中に、lrが立ち上がっていく領域が含まれないようにすることが重要である。

3/24追記：cyclicの後に1 sycleの発想が出てきている。ここまでの自分の実験からは、サイクル数、学習率の範囲と変化形状、などの最適値は、用いるデータの量や質に依存し、モデルや画像処理にも依存するので、実験を適切に繰り返しながら見つけよう。

CyclicLRのエポック数は12から100くらいまで、OneCycleLRのエポック数は100から800くらいまで表に記載されている。max_lrが大きいほど、エポック数は多い傾向にある。自分が計算した例では、max_lrを3にすると異常値が現れてモデルが壊れることがあった。そういうときは、エポック数を増やすか、max_lrを下げるかする必要がある。

ずっと気になっていたのだが、train_lossとval_lossしか計算していないので、明日は、train_dice, val_diceを計算して表示できるようにしよう。

3月23日（火）

GPU : 17h 38m available of 41h

HuBMAP：

20エポック、3サイクル、0.1-2、（傾斜：(max_lr-base_lr)/step_size_up：(2-0.1)/500)：LB=0.869：（昨日の結果）

20エポック、3サイクル、0.1-1、（傾斜：(max_lr-base_lr)/step_size_up：(2-0.1)/500)：LB=0.845：（昨日の結果）

20エポック、4サイクル、0.1-2、（傾斜：(max_lr-base_lr)/step_size_up：(2-0.1)/373)：LB=0.851：傾斜が1.34倍（max_lr=2.65と同等）

傾斜もmax_lrも同じにして、4サイクルにする：

27エポック、4サイクル、0.1-2、step_size_up=503：LB=0.846：

train_loss=0.0249, val_loss=0.0296：lossから予想されるLBスコアとの乖離は非常に大きいと感じる。

LB=0.869の再現実験：

１．条件設定を完全に一致させる：

２．train_lossもval_lossも表示範囲で一致、出力も見える範囲で完全一致

３．LB=0.869は、再現された。

max_lrを2から1にしただけで、LBスコアは、0.869から0.845に下がり、傾斜もmax_lrも同じにして、サイクルを1つ増やしただけで、0.869から0.846に下がった。

max_lr=2.25とmax_lr=1.75を試してみよう。

20エポック、3サイクル、max_lr=2.25 : LB=0.863 : (train_loss=0.0269, val_loss=0.0324)

20エポック、3サイクル、max_lr=1.75 : LB=0.869 : (train_loss=0.0266, val_loss=0.0352)

3月27日追記：20エポック、3サイクルでは、max_lrは2より小さい方が良いのか、もう少し調べてみたい。

OneCycleLRはAdamWを使っていた。CyclicLRでは、AdamWが使えないので仕方なくSGD(nesterov=True)を使っているのだが、良い結果が得られているので、OneCycleLRでも、再度、SGDを使ってみようかなと思う。

エポック数：

Leslie N. Smith氏の論文の表を見ると、エポック数を25、50、75、100、150と増やせば、確実に、スコアは上がっている。データ量が多い場合はこういう傾向になるのだろう。他方で、データ量が少ないときには、20エポックくらいで、CyclicLRやOneCycleLRが良い結果を与えることを、実験して確かめているようだ。

diceの計算：次のサイトから借用させていただく

https://github.com/UCAS-Vincent/UNet-3Plus/blob/master/metrics.py

def dice_coef(output, target):
smooth = 1e-5

if torch.is_tensor(output):
output = torch.sigmoid(output).data.cpu().numpy()
if torch.is_tensor(target):
target = target.data.cpu().numpy()
#output = torch.sigmoid(output).view(-1).data.cpu().numpy()
#target = target.view(-1).data.cpu().numpy()

intersection = (output * target).sum()

return (2. * intersection + smooth) / (output.sum() + target.sum() + smooth)

......

dices = []

for image, target in loader:

......

dice =dice_coef(output, target)

dices.append(dice.item())

np.array(dices).mean()

動作確認中！！！

明日は、30エポック、3サイクル、0.1-3、1サイクルの前後の割合を30:70と70:30にした結果を比較してみる（OneCycleLRのデフォルトは30:70である：pct_start=0.3）。さらに、base_lrを0.1より小さくした場合についても調べる予定。OneCycleLRでも、SGDを試してみよう。

3月24日（水）

HuBMAP：1,291 teams：48 days to go : 0.869 ---> ???

remaining time of GPU：7 h

30エポック3サイクル、0.1-3、30：70：LB=0.857 : (train_loss=0.0253, val_loss=0.0368)

30エポック3サイクル、0.1-3、50：50：LB=0.835 : (train_loss=0.0253, val_loss=0.0353)

30エポック3サイクル、0.1-3、70：30：LB=0.834 : (train_loss=0.0270, val_loss=0.0310)

50:50は既出。昨日までのCyclicLRの結果は、すべて、50:50である。

lossは、最終エポックの平均値であり、モデルの最終パラメータは最終ステップ（バッチ）の学習結果だと思うので、これだけでも、対応関係は慎重に判断しないといけないように思う（この見立てが正しければ）。dice_coefficintの計算も、エポック毎の計算だと、予測性能の評価指標として不十分になる可能性はlossとかわらないだろう。今使っているCyclicLRは、最終エポックと最終ステップ（バッチ）における評価値（lossやacc）の差が、大きくなりやすい。（ということなのだろう）

30:70が良さそうだということで、ここまででLBスコアが最も高かった条件で、50:50を30:70に変えただけで計算し、commit, submitした。

20エポック3サイクル、0.1-2、30：70 : LB=0.849 : (train_loss=0.0252, val_loss=0.0301)

lossがエポック毎の平均値だからあてにならないとはいえ、つい、この値で判断してしまう。

submitするときにこのlossを見て、自己記録更新を確信した。

LB=0.849を見て、落胆した。

まさか、0.02も下がるとは思わなかった。

public_test_dataとの相性が悪いのだ、と思いたいが、汎用性が低い、すなわちoverfitting状態だということだろうと思う。

lossの全体としての変化の傾向から判断することにしよう。

この場合はoverfittingになったと仮定して、それを解消する方法として、少し条件のきついtransformを適用してみよう。transformの条件を強くし、他の条件はそのままで計算してみよう。

計算の結果、最終エポックでのlossは、train_loss=0.0266, val_loss=0.0372となった。train_lossが少し大きくなっているのは期待通りだが、val_lossは、平均値としてみても、あるいは１つ手前のエポックでの値と比較しても、少し大きすぎるように思う。

結果は、LB=0.875となった。自己新記録だ！

これは、30:70：LB=0.849からの0.026ものジャンプアップだ。

それならば、50:50 : LB=0.869に対して、この条件を強めたtransformを適用するとどうなるだろうか。明日計算してみよう。

もし、50:50でoverfittingになっていないのであれば、それに対して強めのtransformを適用すると、underfittingになる可能性もある。実験してみよう！

GPUの残りは5時間、20エポックで35-40分、commitがあるから70-80分、残り4条件程度となる。

OneCycleLRからCyclicLRへと移ってみたが、CyclicLRの完成系の１つが、OneCycleLRのようである。自分の計算結果を見直すと、train_lossもval_lossもOneCycleLR(optimizer=AdamW)を用いた場合の方が小さくなっている。

OneCycleLRを用いたときのLBスコアが0.86+で頭打ちになっていたのは、overfittingへの対策が不十分だったということだろうと思う。

ということで、明日は、OneCycleLR(optimizer=AdamW)と強めのtransformを用いて計算してみる。

3月25日（木）

HuBMAP： 1,295 teams：47 days to go : 0.875 ---> ???

remaining time of GPU：3.5 h

CyclicLRとSGDを組み合わせてある程度うまくいったので、OneCycleLRもSGDと組み合わせて使ってみよう。

OneCycleLRとAdamWとの組み合わせでLB=0.865となったコードをそのまま使う。

OneCycleLR(AdamW, maxlr=0.001) : LB=0.865 (train_loss=0.0238, val_loss=0.0311)

OneCycleLR(SGD, nesterov=True, max_lr=3) : LB=0.861 (train_loss=0.0252, val_loss=0.0286)

train_lossが小さいAdamWの方に、強めのtransformを適用してみよう。

OneCycleLR(AdamW, max_lr=0.001) : LB=0.869 (train_loss=0.0251, val_loss=0.0322)

0.004のアップ。期待したほどではなかった。

SGDのOneCycleLRを使ったものに対しても強めのtransformを試してみよう。

OneCycleLR(SGD, nesterov=True, max_lr=3) : LB=0.862 (train_loss=0.0262, val_loss=0.0328)

0.001のアップ。LB=0.875を超えることを期待したが、そうはならなかった。

次の1週間（3月27日（土）～4月2日（金））は、次のような、細かいチューニングをやってみよう。

⇒ OneCycleLR(SGD)：max_lr=3を2.5, 2.0, 1.5,1.0等に変える。

⇒ OneCycleLR(AdamW) ：epochs=20を18, 22, 24等に変える。

⇒ OneCycleLR(SGD), (AdamW) : pct_start=0.3を0.4, 0.5等に変える。

⇒ CyclicLR(SGD)：3cycle epochs=3n (18, 21, 24, ... )

⇒ CyclicLR(SGD)：epochs=20, 3 cycle (step_size_up=297, step_size_down=696), max_lr=2で、mode="triangular2"を試す。

⇒ CyclicLR(SGD)：base_momentum=0.85, max_momentum=0.95,を試す。

⇒ augmentationのCutout, blurなどのパラメータを変えてみる。

⇒ batch_size=16 --> 12 --> 8 --> 4 --> 2 -->1 ???

⇒ 画像解像度：EfficientNetB2でバッチ数を下げて解像度を上げる。256x256 --> 320x320 --> 384x384 --> 448x448 --> 512x512：overlapも増やす、32 --> 48 --> 64 --> 96 --> 128

⇒ B2 --> B3 --> B4：これは、B2で、十分に検討した後で！

⇒ train_dataの使用量を増やす。

overfittingの程度によって、augmentationの効果は変わる。OneCycleLRはエポック数によってoverfitting側にもnderfitting側にもある程度制御しながら振ることができるので、今日実験した系に対して、追加で試してみようと思っている。

20エポック、3サイクル、0.1-2、50:50 : LB=0.869に対して強めのtransformを試す予定だったが、OneCyleLR（SGD）を試したことで、GPUを使い切ったため、土曜日以降になる。

CyclicLRは、OneCycleLRよりも自由度が高く、良い値も得られているので、続けてみようと思う。CyclicLRには3種類のモードがあって、"triangler"は三角形の頂点の学習率が変わらないが、"triangler2"は、三角形の頂点の学習率が、1, 1/2, 1/4と減衰する。さらに、“exp_range”では、三角形の頂点の学習率が指数関数的に減衰する。gammaとステップ数によって減衰率を変えることができるのだが、ピーク値がどういう値になるかは与式に従って計算しておく必要がある。

1サイクルが300+700チャンネルで3サイクルまでの場合：

gamma=0.999：各サイクルのピーク値は、0.74, 0.27, 0.10となる。

gamma=0.9993：0.81, 0.40, 0.20となり、"triangle2"と似たような変化率になる。

gamma=0.99965：0.90, 0.63, 0.45 (1.0 : 0.7 : 0.5)：ちょっと試してみたくなる。

Cyclicと1 Cycleの比較：

Cyclicのデフォルト：base_momentum=0.8, max_momentum=0.9

1 Cycleのデフォルト：base_momentum=0.85, max_momentum=0.95

Cyclicのbase_lr：固定かつデフォルト無し：論文では0.1が多いように思う。

1 Cycleのbase_lr：可変：初期値はmax_lr/div_factor（デフォルトはmax_lr/25）、最終値はmax_lr/div_factor/final_div_factor（デフォルトはmax_lr/div_factor/1e4）：1 Cycleでは、学習率の初期値も最終値も、Cyclicより小さく設定しているようである。この方が良いということであれば、Cyclicのbase_lrは0.1よりも小さい値にした方が良さそうである。実験結果では、一例に過ぎないが、20エポック、2サイクル、max_lr=2で、base_lr=0.1, 0.001, 0.00001にしたとき、LB=0.839, 0.853, 0.847となった。

3月26日（金）

HuBMAP：1,296 teams, 46 days to go : top 0.934

augmentationを正しく理解したい。

基本的な考え方：augmentationを行う確率は100％より低い。ある割合は元画像をそのまま通過させるためだ。ただし、切り出しやサイズ変更では100％処理もありうる。データ量が多い場合は、augmentationの必要性は少なく、確率は10-30％で変化の強度も低くする。データ量が少ない場合は、それぞれ40-50%の確率で処理し、変化の強度も高くする。

画像を水増しするというが、1つの処理を50％の確率で行うと、倍になるのか。100枚の画像に対してノイズ添加を50％の確率で行うと、50％はノイズが加わり、50％がそのまま通過する。トータルは100枚のままである。水増しの文字通りの意味であれば、元画像はそのまま使い、そのうちの半分にノイズ添加をして元の画像に足せば、150枚の画像が出来上がることになる。あるいは、100枚にノイズを添加して元の画像に加えれば200枚になる。このあたりのことを具体的に何も知らないのがはがゆい。

自分が使っているコードの動きを見ている限りでは、指定した割合でデータ加工されているようで、加工の種類を増やすとそれだけ元画像の割合が少なくなるようであり、これでいいのかなと思ってしまう。

それだったら、オートフォーカスじゃないけど、全ての画像をCNN処理しやすいように前処理できればそれがベストということになるのか。前処理せずとも全ての画像を正しく処理できるモデルを構築するということになるのか。求められているのは後者だ。実画像にありそうな画像処理をしてモデルを訓練するということだ。

import albumentations as A

transform = A.Compose([
A.RandomCrop(512, 512),
A.RandomBrightnessContrast(p=0.3),
A.HorizontalFlip(p=0.5),
])

3つの処理が上から順に並んでいる。処理は、この順番に行われる。

１．元画像から512x512の部分をランダムに切り取る。これは、100％行われる。

２．１．で処理された画像のうち、3割は、明るさやコントラストの変更がランダムに行われ、7割は、未処理で通過する。

３．２．を通過した画像のうち、5割は水平方向にフリップされ、5割はそのまま通過する。

これだと、入力画像数と出力画像数は同じで、cropされただけで通過する画像は35％、cropされてフリップされた画像が35％、cropされて明るさやコントラストを変更されただけの画像が15％、cropされ明るさやコントラストを変更されフリップされた画像が15％ということになる。

こうして、1式の画像を処理した画像がメモリーに保存されると、学習中、エポック単位の画像は変化しない。エポック毎にaugmentationを行うという方法が使えれば、エポックごとに違う画像で訓練することができて、良さそうに思うが、計算時間はそれなりに増えそうに思う。

このあたりの制御を的確に行うためには、ゼロからプログラムする技術、プログラムを正確に読む技術、プログラムを正確に組み込む技術、などが必要になるので、そこを意識してプログラムを読もう。

このコンペに必要な加工を考えて、augmentationのパラメータを決めよう。そのためには、モデルが予測した結果を画像化して、どのようなミスをしているのかを具体的に把握して対策していくことが必要かもしれない。LB_scoreトップの方が、セグメンテーションが難しい糸球体を特定し、それらの画像に特化した対策をすることによってLBスコアが上がったというようなことを言っておられたと思う。

3月27日（土）～ 4月2日（金）LB=0.875 ---> 0.88+

１．0.875となったモデルの周辺探索（epochs, augmentation, resolution, max_lr, ...）

２．CyclicLRの、"triangular2 ", "exp_range"の検討（21エポック以下を重点的に）

3月27日（土）

HuBMAP：1,303 teams, 45 days to go : top 0.934

GPU : 42 h

Run 1 : based on the model of LB=0.875 (0.0266, 0.0372) : E20 B16(149) 1024,32,256 trfm_4 b2-Unet CyclicLR SGD(nestelov) lr=0.1-2 triangular cycle=3 298,696 momentum=0.8-0.9 BCE train_data=8+26d...+4ef... : lr=0.1-2 --> lr=0.001-2 : LB=0.867 (-0.008) (0.0271, 0.0347)

Run 2 : triangular --> triangular2 : LB=0.861 : (0.0258, 0.0375)

Run 3 : batch_size=16 --> 8 : 4エポック目まで普通だったのだが、5エポック目からtrain_loss, val_lossともに異常値を示し、5エポックほど待ったが、全く改善されなかったので中止した。

バッチサイズが変わると1エポックあたりのステップ数（データ量÷バッチサイズ）が変わるので、パラメータを計算し直さないといけないのだが、それを怠ったことによるエラーであることが分かった。20エポック、6サイクルの設定になっていた。

Run 4 : batch_size=16 --> 24 : LB=0.867 (0.0255, 0.0333)

これも再計算を怠ったので、20エポック、2サイクルで動作したことになる。

Run 5 : momentum 0.80, 0.90 --> 0.85, 0.95 : LB=0.866 (0.0278, 0.0329)

Run 6 : image 256x256 --> 320x320 : LB=0.877 (0.0238, 0.0319)

0.002だけだが、自己新記録となった。

バッチサイズ効果については、パラメータを正しく設定してやり直す必要がある。

Run 1のbase_lrの値は、OneCycleLRの設定を真似て少し低くしてみたのだが、効果的ではなかったようだ。

Run 2のtriangular2は、サイクルごとに学習率lrのピーク値が半分になるので、ピーク値が同じ場合よりも収束しやすくなって良いかなと思ったが、こう思ってしまうのは、lrを段階的に下げるとか指数関数的に下げるという従来の方法が頭にこびりついているせいだろうなと思う。

バッチサイズは、ケースバイケースのようでよくわからず、試してみたいと思ったのだが、パラメータ設定をミスったので、明日にでも、やりなおしてみる。

momentumは、OneCycleLRとCyclicLRとでデフォルトが違っているので、試しに相手側のデフォルトにしてみたということ。今回は効果的ではなかったが、使える場面はあると思う。

画像解像度を上げることは、効果的だが、overfittingとの戦いになりそう。

3月28日（日）

HuBMAP：1,309 teams, 44 days to go : 0.057 to first place

今日は、昨日効果が認められた320x320より画素数を増やしてみよう。

B2-Unet, batch_size=16, 320x320, CyclicLR(SGD, epochs=20, up=298, down=696, 0.1-2.0, momentum=0.8, 0.9, triangular, ... : LB=877 (0.0238, 0.0319)

384x384 : LB=0.873 (0.0229, 0.0323)

448x448 : LB=0.860 (0.0212, 0.0303)

512x512 : LB=0.871 (0.0202, 0.0294)

train_lossもval_lossも小さくなっているので、スコアが上がってもよさそうなものだが、足ふみ状態だ。train_lossが高めであることから、いずれもoverfittingになっているのだろうと思う。

ここで、decoderを変えてみよう。

512x512に対する対応可否の確認も兼ねて512x512から始めよう。

Linknet：512x512 : LB=0.876 (0.0220, 0.0285)

PSPNet：収束せず。

FPN：収束が非常に悪いので中止。

MAnet：512x512 : LB=0.865 (0.0220. 0.0284)

PSPNetとFPNの応答が良くない。その原因を調べたいが、後日とする。

3月29日（月）

HuBMAP：1,313 teams, 43 days to go :

512x512について、Unet, MAnetよりも良いLBスコアを示したLinknetについて、他の解像度についても実験してみる。

256x256 : (0.0301, 0.0356) : commitまで

320x320 : 見送り

384x384 : 見送り

448x448 : (0.0224, 0.0291) : LB=0.862

Linknetの256x256, 448x448のlossの値がUnetよりも高い。このことは、LinknetがUnetよりも性能が悪いことを示唆している。512x512でも、UnetよりLinknetのlossが大きく、それでもLBスコアが高いのは、Linknetの性能が少し悪いことによってoverfitが少し抑えられたためであると推測される。Linknetについては、ひとまず中断。

解像度の低い画像に対してはencoderを大きくし、解像度の高い画像に対してはencoderを小さくする、というのも1つの方法かもしれない。

encoderをb2からb3, b4に変更してみた。すると、大きな問題が生じた。

320x320にb3-Unetを用いると、次に示すように、3サイクル目の3エポック目からlossが1桁大きくなりそのまま収束しなくなった。

1│ 0.1397│ 0.0703│ 1.70
2│ 0.0531│ 0.1602│ 1.67
3│ 0.0576│ 0.0428│ 1.67
4│ 0.0378│ 0.0418│ 1.67
5│ 0.0356│ 0.0394│ 1.66
6│ 0.0316│ 0.0344│ 1.67
7│ 0.0292│ 0.0327│ 1.66
8│ 0.0288│ 0.0353│ 1.67
9│ 0.0341│ 0.0370│ 1.67
10│ 0.0350│ 0.0405│ 1.67
11│ 0.0334│ 0.0339│ 1.67
12│ 0.0299│ 0.0304│ 1.67
13│ 0.0274│ 0.0355│ 1.67
14│ 0.0260│ 0.0322│ 1.67
15│ 0.0248│ 0.0337│ 1.66
16│ 0.0273│ 0.1100│ 1.68
17│ 0.3699│ 0.2800│ 1.68
18│ 0.2860│ 0.2863│ 1.67
19│ 0.2851│ 0.2800│ 1.67
20│ 0.2845│ 0.2840│ 1.67
21│ 0.2855│ 0.2821│ 1.69

数字は、左から、エポック数、train_loss, val_loss, 計算時間（分）を表している。

3サイクル目は15エポックから始まり、17エポック目の途中で学習率lrが最大になるように設定されている。2サイクル目まで問題ないように見えて、3サイクル目でこのような現象が生じるとは思わなかった。

20エポック、3サイクルで、max_lr=3以上に設定すると、学習率lrが最大になるエポックで異常値を示すということはわかっていたが、今回のように、2サイクルまで正常だったので、驚いた。

max_lrを2.0から1.5に下げて、他はそのままで計算してみた。

15│ 0.0243│ 0.0295│ 1.66
16│ 0.0266│ 0.0417│ 1.67
17│ 0.0300│ 0.0360│ 1.69
18│ 0.0269│ 0.0586│ 1.66
19│ 0.0262│ 0.0431│ 1.67
20│ 0.0241│ 0.0339│ 1.66
21│ 0.0228│ 0.0278│ 1.68

異常は生じずに、収束した。

encoderのサイズや画像の解像度によって、max_lrを最適化することが必要であることがわかった。

256x256の画像に対してencoderにEfficientNetB3を使ったときに、最終エポックのtrain_lossが少し大きくなっている（収束が良くない）ように感じたことがあった。このときも、症状は軽いながらも、異常が生じていたということだろうと思う。

ということであれば、4サイクルに増やした時に、4サイクル目で収束しなかったのも、同様の現象が生じていたのかもしれない。すなわち、サイクルを繰り返すと、訓練が進むが、訓練が進んでも同じmax_lrを適用していると、学習率lrを大きくした効果がなくなるだけでなく、かえって、モデルが悪化したり、収束しなくなるということだろう。学習が進んだモデルに大きすぎる学習率を適用すると、完成に近づいていたモデルが壊れてしまう、ということだろう。

そういうこともあって、サイクル毎にピーク値が半減する"triangular2"がオプションとして用意されているということなのだろう。

さらに、max_lrをサイクルごとに減衰させることができ、かつ、減衰の仕方を変えることができる"exp_range"というのも、オプションとして用意されている。

3サイクル、21エポック、学習率のベースと最大値：0.1-1.5

256x256 : b3 : (0.0266, 0.0315) : LB=0.858

320x320 : b3 : (0.0228, 0.0278) : LB=0.868

encoderを大きくすることで、val_lossが小さくなってoverfittingが抑制されることを期待したが、max_lrを下げざるを得なくなって、Super-Convergenceがうまく働くなってしまっているのだろうか。

overfitting対策のために、augmentationの調整をしてみよう。

早速augmentationのミスがあった。

Cutoutを使っているのだが、切り出しサイズのピクセル数が、256 x 0.1のように絶対値にしていた。そうすると、画像解像度が2倍になったとき、切り取りサイズが相対的に半分になる。面積で1/4。画像の画素数が多いほど、augmentationの効果は小さくなっていたということになる。

Cutoutのサイズを画素数に連動するよう変更するとともに、確率、サイズ、個数等の適切な値を探してみよう。

次の論文のFig. 1では、隠すのは1か所で、1/4面積の正方形で、はみだしOKで、100％適用している。はみだし無しで50％とか、サイズの異なるものも試みられているようである。Kaggleのコンペで見かけるのは、サイズを小さくして数を増やしたものが多い。

f:id:AI_ML_DL:20210329213646p:plain

f:id:AI_ML_DL:20210329213555p:plain

先入観にとらわれず、色々なパターンを実験してみるしかないように思う。借り物は,小さい正方形を10枚使っている。

面積1/4のサイズを1枚：1/2 x 1/2 ：確率25％に設定。

1│ 0.1415│ 0.0802│ 2.08
2│ 0.0614│ 0.1015│ 2.05
3│ 0.0744│ 0.2674│ 2.05
4│ 0.0598│ 0.0492│ 2.06
5│ 0.0479│ 0.0665│ 2.04
6│ 0.0438│ 0.0423│ 2.03
7│ 0.0404│ 0.0439│ 2.04
8│ 0.0396│ 0.0905│ 2.05
9│ 0.0472│ 0.0615│ 2.05
10│ 0.0490│ 0.0418│ 2.06
11│ 0.0437│ 0.0417│ 2.07
12│ 0.0396│ 0.0459│ 2.04
13│ 0.0367│ 0.0365│ 2.05
14│ 0.0352│ 0.0389│ 2.04
15│ 0.0365│ 0.0804│ 2.06
16│ 0.0377│ 0.0459│ 2.07
17│ 0.0406│ 0.1307│ 2.09
18│ 0.0411│ 0.0389│ 2.07
19│ 0.0370│ 0.0420│ 2.06
20│ 0.0329│ 0.0397│ 2.07
21│ 0.0340│ 0.0365│ 2.08

0.03以下でないと使えない。

面積1/10 のサイズを1枚≒0.316x0.316：確率25％に設定。

1│ 0.1341│ 0.0948│ 1.91
2│ 0.0531│ 0.0647│ 1.88
3│ 0.0431│ 0.0494│ 1.89
4│ 0.0389│ 0.0364│ 1.86
5│ 0.0358│ 0.0354│ 1.87
6│ 0.0316│ 0.0367│ 1.88
7│ 0.0296│ 0.0295│ 1.88
8│ 0.0289│ 0.0412│ 1.87
9│ 0.0307│ 0.0409│ 1.86
10│ 0.0335│ 0.0333│ 1.87
11│ 0.0315│ 0.0369│ 1.88
12│ 0.0298│ 0.0325│ 1.86
13│ 0.0271│ 0.0337│ 1.89
14│ 0.0268│ 0.0330│ 1.87
15│ 0.0258│ 0.0397│ 1.88
16│ 0.0282│ 0.0364│ 1.87
17│ 0.0296│ 0.0818│ 1.88
18│ 0.0323│ 0.0323│ 1.87
19│ 0.0274│ 0.0340│ 1.87
20│ 0.0250│ 0.0366│ 1.88
21│ 0.0256│ 0.0295│ 1.88

LB=0.874：A.cutoutが効いたかどうかはまだ不明。P=0.のときの結果が必要。

3サイクル目のlossの推移が若干、不自然のようだ。前に書いたが、3つの山が同じ高さだと、学習がある程度進んだあとの3つ目の山で収束が悪くなることや、モデルが壊れてしまうことがある。うまくいってるように見えても、収束が悪くなっている可能性がある。

ただし、ピークの高さゆえの、Super-Convergenceを利用しているので、全体を低くしてしまうわけにはいかない。

21エポック、3サイクル、batch\size=16の場合に、"exp_range"でgamma=0.9999とすれば、3つの山のピークは、0.969, 0.873, 0.787となる。こういうのを試してみよう。

gamma=1.0 (=triangular) 1.0, 1.0, 1.0 : LB=0.874 (0.0256, 0.0295)

gamma=0.9999 : 0.969, 0.873, 0.787 : LB=0.863 (0.0256, 0.0294)

gamma=0.9998 : 0.939, 0.762, 0.619 : LB=0.852 (0.0248, 0.0312)

学習率のピーク高さを減衰させると、LBスコアが低下する、という傾向が認められた。

4月1日（木）

GPU : no time left

ようやく、val_dice_coefficientを表示することができるようになった。

4月2日（金）

GPU : no time left

CPUを使って途中まで計算した結果を眺めてみよう（16エポックで止めた）。

21エポックで3サイクルになるように設定した（1サイクルが7エポック）。

CyclicLR(SGD, lr : 0.1-2.0),loss_fn=BCE, EfficientNetB2-Unet

左から、エポック数、train_loss, val_loss, train_dice, val_dice, 計算時間(min)

1│ 0.1181│ 0.0710│ 0.6080│ 0.7867│ 26.24
2│ 0.0676│ 0.4731│ 0.7739│ 0.3525│ 26.23
3│ 0.0789│ 0.0875│ 0.7445│ 0.8000│ 26.10
4│ 0.0563│ 0.2749│ 0.8181│ 0.4594│ 25.97
5│ 0.0479│ 0.0423│ 0.8418│ 0.8608│ 25.97
6│ 0.0398│ 0.0418│ 0.8616│ 0.8545│ 25.82
7│ 0.0356│ 0.0334│ 0.8742│ 0.8781│ 25.85
8│ 0.0358│ 0.0420│ 0.8740│ 0.8638│ 26.04
9│ 0.0434│ 0.0935│ 0.8593│ 0.7237│ 26.04
10│ 0.0495│ 0.0421│ 0.8410│ 0.8683│ 26.10
11│ 0.0426│ 0.0381│ 0.8549│ 0.8719│ 26.16
12│ 0.0361│ 0.0386│ 0.8747│ 0.8560│ 26.08
13│ 0.0339│ 0.0341│ 0.8789│ 0.8730│ 25.93
14│ 0.0317│ 0.0334│ 0.8867│ 0.8852│ 26.07
15│ 0.0313│ 0.0363│ 0.8873│ 0.8756│ 26.31
16│ 0.0362│ 0.0432│ 0.8784│ 0.8691│ 26.11

val_diceは、サイクルの最後のエポック数において、最大値を示している。

2サイクル目の最後のエポックでのスコアの方が高い。今は、3回繰り返した後で、commit, submitしている。

さて、3月28日のLB=0.877を超えるためにどうすれば良いか考えよう。

これまでにわかったこと（確実なことは1つもない、すべては相対的）：

・train_dataは、当初の8件だけよりは、それに2件追加して10件にすることで良くなった。15件全部使ってみたいが、今のやり方では、メモリーの制限に引っかかり、1件も増やせない。解決策はある筈なので見つけたい。

・augmentationは、機械的変形を強くしすぎると逆効果になる。形状変化を伴わない適切な光学的処理がよさそうである。Cutoutも効いているようだが、要検討。

・LB=0.87+は、CyclicLR(SGD)のみであり、20または21エポックの3サイクルで、lrの範囲は0.1-2.0である。これがベストだとは思っていない。

・画像解像度は、1週間くらい前から検討しているが、overfitting状態である。

・EfficientNetB2-Unetを超えられないのがもどかしい。B3やB4、FPNやMAnet, Linknetなどを少し試したが、それぞれに、固有のチューニングが必要で、収拾がつかなくなりそうだと感じている。

・やっと計算できるようになったDaice coefficientとLBスコアとの関係が非常に気になる。discussionでは、CVとLBは近いとのこと。0.932:0.916, 0.94:0.932, 0.923:0.924などが報告されている。

・CyclicLRを使うことによって、LBスコアが0.87+になった。

・CyclicLRのサイクル数は、論文では3サイクル以上が良いと書かれていたが、3サイクル（v28 : LB=869 : 0.0267, 0.0314）から4サイクルに増やすとスコアが下がった（v36：LB=0.851 : 0.0255, 0.0306）ので、その時点（3月23日）でサイクル数を増やすのをやめた。最終エポックでのtrain_lossとval_lossは、4サイクルの方が小さいにも関わらずLBスコアが悪く、その原因が、4サイクル目の途中のlossの異常値(0.0282, 0.1107)にあると考えたためである。しかし、この判断は早計だったかもしれない。

4サイクルを止めた原因となった2つのデータを比較する。

LB=0.869のデータ：20エポックで3サイクルに設定しているので、1サイクルの最後、2サイクルの最後はサイクルの最終位置ではない。（バッチ単位（ステップ）でサイクルを決めているため）

1│ 0.1264│ 0.0543│ 1.05
2│ 0.0544│ 0.0593│ 1.03
3│ 0.0446│ 0.0572│ 1.03
4│ 0.0446│ 0.0766│ 1.03
5│ 0.0481│ 0.0416│ 1.03
6│ 0.0390│ 0.0375│ 1.03
7│ 0.0329│ 0.0326│ 1.03
8│ 0.0327│ 0.0378│ 1.03
9│ 0.0337│ 0.0335│ 1.03
10│ 0.0343│ 0.0392│ 1.03
11│ 0.0339│ 0.0386│ 1.03
12│ 0.0306│ 0.0379│ 1.03
13│ 0.0289│ 0.0330│ 1.03
14│ 0.0270│ 0.0373│ 1.03
15│ 0.0274│ 0.0353│ 1.04
16│ 0.0278│ 0.0337│ 1.02
17│ 0.0324│ 0.0978│ 1.03
18│ 0.0348│ 0.0370│ 1.03
19│ 0.0282│ 0.0316│ 1.03
20│ 0.0267│ 0.0314│ 1.03

LB=0.851のデータ：これは、20エポックで4サイクルに設定しているので、最終サイクル位置は、5, 10, 15, 20エポックと一致する。

1│ 0.1229│ 0.0606│ 1.10
2│ 0.0552│ 0.0577│ 1.09
3│ 0.0441│ 0.0484│ 1.07
4│ 0.0399│ 0.0361│ 1.10
5│ 0.0346│ 0.0315│ 1.12
6│ 0.0319│ 0.0384│ 1.09
7│ 0.0334│ 0.0421│ 1.07
8│ 0.0351│ 0.0416│ 1.08
9│ 0.0335│ 0.0314│ 1.11
10│ 0.0307│ 0.0314│ 1.09　　　　　　　　　　　　　　　　　　　　　　　　　　　11│ 0.0281│ 0.0378│ 1.09
12│ 0.0294│ 0.0362│ 1.08
13│ 0.0315│ 0.0348│ 1.12
14│ 0.0297│ 0.0372│ 1.16
15│ 0.0268│ 0.0349│ 1.12
16│ 0.0258│ 0.0286│ 1.11
17│ 0.0282│ 0.1107│ 1.00
18│ 0.0323│ 0.0368│ 1.13
19│ 0.0277│ 0.0321│ 1.09
20│ 0.0255│ 0.0306│ 1.10

このときは、LBスコアが低かったことと、赤で示した0.1107から、4エポックはダメと判断したのだが、3サイクルの場合でも、赤で示したように 0.0978というのがある。
4サイクルにすることでlossは下がっているので、LBスコアが3サイクルより低くなった原因は他にあると考えて、サイクル数を増やしてみようと思う。
3サイクルと4サイクルを正しく比較するには、1サイクルあたりのエポック数を同じにしなければならない。1サイクルを何エポックにするかは最適化できていないのでこれも含めて実験する必要がある。1サイクルの中での学習率が最大になるピーク位置についても最適化が必要で、この2つの場合は1サイクルの中心だったが、ある時点(v43)から、OneCycleLRでのデフォルト0.3(3:7)になるようにステップ数(step_size_up, step_size_down)を設定している。

4月3日（土）

GPU : 36 h (5.14 h/day)

Run 1：LB=0.875の計算過程におけるDice coefficientの確認（20エポック3サイクル）と、3サイクル以降のtrainingの経過調査（40エポック6サイクルに設定し、40エポックでのDice_coeffが良ければcommit, submitする。

20エポックでのval_dice_coeff=0.888であった。これがLB=0.875だというのは納得できる値だと思う。40エポック目のval_dice_coeff=0.9011となったので、commit, submitしてみよう。その結果が良ければ、60エポック、あるいは80エポックを試してみよう。結果は、LB=0.799であった。こうなると、次の一手が分からなくなってしまう。

勝手にCutoutを悪者扱いして、Cutoutの影響/効果を調べる。そのために、Cutoutの確率0.25と0.0を比較する。

Cutout=0.25：val_dice_coeff=0.895 : LB=0.862

Cutout=0.0 : val_dice_coeff=0.903 : LB=0.850

Cutoutは、overfittingを抑制する効果があったということかな。

それにしても、val_dice_coefficientがまったく役に立たないとは、困ったものだ。

明日は、手前のサイクル数で止めて、LBスコアを出してみよう。しかし、こんなことをしても、前の週からの進展は、殆ど期待できないような気がする。

val_dice_coefficientを計算することで、条件検討が捗ると期待したが、これでは、なんにもならない。何の指標にもならないのか。それとも、プログラミングに問題があるのか。val_lossと同一の手順で、val_lossに併記しているので、validationの結果を計算できている筈なのだが。

今日は、最高に、期待外れの日だった。

最初に、20エポック3サイクルでLB=0.875となったモデルで、20エポックの先を見るために、40エポック6サイクルの計算を行い、40エポック後にcommit, submitしたら、LB=0.799となった。val_Diceは20エポックで0.888となり、40エポックでは0.9011まで上がったので、20エポックの0.888でLB=0.875だから、40エポックの0.901では、LB=0.885くらいになると予想していたのだが、LB=0.799という値となった。残念過ぎる結果だ。

20エポック3サイクルというのは、立ち上がりが2エポックで、立下りが6.666エポックである。これだと学習率の最小値や最大値がエポックの中間にあらわれるので、立ち上がりを2エポック、立下りを4エポックという組み合わせに変更した。これだと、6エポック単位でサイクルが繰り返されるので、たとえば、5,6サイクルくらいまで計算しておいて、良さそうなサイクル数を選んで再度計算するのに都合が良いと考えた。

ということで、30エポック5サイクルで計算し、augmentationを3段階に設定して計算した。LBスコアは、augmentationの弱い方から、0.850, 0.862, 0.847となった。これもまた平凡な値になった。これらのval_Diceは、0.903, 0.895, 0.896であった。

3サイクルで0.875のモデルを6サイクルまで訓練したら、性能がガタ落ちしたということである。overfittingだけでは片づけられない問題だと思う。

augmentationの条件を3段階変えた例では、0.852がoverfitting、0.862は、augmentationによる予測性能の向上もしくはoverfittingの抑制、0.847は過剰なaugmentationということか。何のことかわからないので、解決策がわからない。

不適切なaugmentationだが、結果としてoverfittingは抑制されてLBスコアが少し上がったが、augmentationを強めると不適切な操作であるため、予測性能は低下した、ということかもしれない。今回変更したのはCutoutで、１）使わない、２）10個の正方形で、各正方形の面積は画像の1％、確率は25％、３）同2％、確率25％。 1％や2％というのは、1次元だと小さく感じるが、2次元だと、ヒトの目には大きく感じるものである。

LB=0.875は、Cutout（25％）とAffine（12.5％）を含んでいる。3サイクルまでは、ある程度学習が進み、適度なaugmentationによって予測性能が向上した。そこからさらに3サイクルほど学習が進む間に、CutoutやAffineのような幾何学的な画像操作による変形を学習してしまい、スライスによる入力画像であるがゆえのアーティファクトに過剰に反応するようになった、ということは考えられないだろうか。

4月4日（日）

GPU : 28 h

今日は、5サイクル30エポック、（１）Cutout無し、（２）0.1、（３）0.1414、について検討する。30エポックまで訓練したモデルのLBスコアは、（１）0.850、（２）0.862、（３）0.847だった。これを過剰訓練により、汎用性が失われたと推測する。過剰訓練によるものかどうかを確認するには、5サイクルより前のサイクルで終了させたモデルの予測性能をLBスコアで評価すればよい。

最初に、（１）Cutout無しについて、1サイクル後（6エポック後）のモデルで予測した結果を調べることから始めてみよう。lossを見ると、計算してみようとか、調べてみようとか思わないのだが、val_Diceの値を見ると、本当にそうなのだろうかと気になる。

6エポックと12エポックの結果が出た。ひどいもんだ。train_lossとval_lossだけ見ているほうが、まだまし、と思ってしまう。

1│ 0.1260│ 0.0678│ 0.5811│ 0.8120│
2│ 0.0554│ 0.0500│ 0.8127│ 0.8475│
3│ 0.0444│ 0.0867│ 0.8497│ 0.7538│
4│ 0.0438│ 0.0386│ 0.8480│ 0.8574│
5│ 0.0387│ 0.0511│ 0.8690│ 0.7533│
6│ 0.0376│ 0.0347│ 0.8624│ 0.8859│ LB=0.850

..........

12│ 0.0286│ 0.0332│ 0.8988│ 0.8953│ LB=0.855

..........

18│ 0.0272│ 0.0322│ 0.9032│ 0.8987│ LB=0.852

..........

24│ 0.0243│ 0.0322│ 0.9125│ 0.9031│

..........

30│ 0.0235│ 0.0398│ 0.9144│ 0.9026│ LB=0.850

（１）Cutout無し、はここまで。（２）Cutout : 0.1、の12エポックと18エポックを計算しておこう。その後で、Dice_coefficientの計算を見直す。

（２）Cutout : 0.1（LB=0.875のモデルとの違いは、1サイクルのエポック数で、LB=0.875のモデルは、3サイクルで20エポック、ここでは、3サイクルで18エポック。

12│ 0.0300│ 0.0363│ 0.8931│ 0.8842│ LB=0.859

..........

18│ 0.0286│ 0.0305│ 0.8987│ 0.8930│ LB=0.859

Cutout : 0.1を追加することによって、train_lossが少し大きくなっている。train_lossは、パラメータの変化に対して最も再現性の良い指標の１つである。これに対して、val_lossは、試料数が少ないことや偏りがあることなどのために、ばらつきが大きい。

LBスコアを上げることはできなかったが、サイクル数の最適化には必要な作業だろうと思う。

＊＊＊Dice coefficientのコードを変更している間に、inferenceのコードもおかしくなってしまったので、LB=0.877(version52/87)のコードまでもどって再スタートする。Dice coefficientのことは、しばらく忘れよう。

この2日間で得たLBスコアは、0.799, 0.862, 0.850, 0.847, 0.832, 0.850, 0.855, 0.852, 0.859, 0.859の10件である。

4月3日から4月9日の計画は、lossとDice coefficientとLBスコアの相関を調べて整理することによって、次の進め方を考えることであった。

この2日間でわかったことは、Dice coefficientの計算に失敗しているらしいこと、Dice coefficientを指標にして決めたサイクル数の設定条件（特に、1サイクルあたりのエポック数：良い結果を得ていたのは3サイクル20エポックであり、これは、1サイクルあたりのエポック数が整数にならないので、今回は、3サイクル18エポック（5サイクル30エポック）の検討と、Dice coefficientの計算と、Dice coefficientを指標にして、サイクル数とLBスコアの関係を調べようとした。得られた結果は、

・Dice coefficientの計算ができるようにしたが、それらしい値は示しているものの、LBスコアとの対応は良くない。

・3サイクル20エポックと同等かそれ以上の性能を求めて、3サイクル18エポックすなわち1サイクル6エポックを検討したが、3サイクル20エポックに近い、3サイクル18エポックでのLBスコアは、0.859となり、3サイクル20エポックでのLBスコア0.875より0.016小さい値となった。5サイクル30エポックでのLBスコアが0.862であったことも加味すると、今の諸々の条件下では、1サイクル6エポックは、1サイクル6.67エポック（3サイクル20エポック）よりも小さいLBスコアになることが分かった。1サイクル6エポックはdownのサイクル数が少ないぶんだけ、学習量が少なく、lossの下がりが少なくなったのだろう。

明日以降の予定：

１．1サイクル7エポック（up: 2 epochs, down: 5 epochs）と8エポック（up: 2 epochs, down: 6 epochs）を試す。

２．Cutout条件と1サイクルのエポック数の見直し：320pixelの画像を使ったときに、LBスコアが、0.877と0.862になった。両者の違いは2か所ある。0.877は、3サイクル20エポック、かつ、cutoutが0.1x256を10か所。0.862は、3サイクル21エポック（1サイクル7エポック）、かつ、cutoutが0.1x320を10か所。0.1x256と0.1x320とでは、ホール面積は1.56倍異なる。

３．Cutout条件の見直し：多数ホールか単一ホールか、および、ホールのサイズ

これらをふまえつつ、512 pixelにフォーカスして検討してみようと思う。

4月5日（月）

HuBMAP： 1,374 teams, 36 days to go

今日は、CyclicLRの1サイクルから始めてみよう。

1サイクル7エポック (up: 2epochs, down: 5 epochs): LB=0.859 (0.0255, 0.0292)

このあたりのLBスコアはもう見たくない。1サイクルのエポック数を増やしてみよう。立ち上がりも2エポックから3エポックに増やしてみよう。立ち上がりを1サイクルの30％にするのがOneCycleLRのデフォルトであることからここでもその値を用いてみよう。そうすると、立ち上がりが3エポック、下りが7エポックとなる。

1サイクル10エポック (up: 3 epochs, down: 7 epochs): LB=0.880 (0.0235, 0.0287)

ようやく、0.877(320 pixel)を超えたが、256 pixelで0.875が得られているので、512 pixelで0.880というのは、まだ何も達成していないのと同じようなものだ。

2サイクル20エポックを試してみよう。

2サイクル20エポック (up: 3 epochs, down: 7 epochs) x 2 : LB=0.870 (0.0203, 0.277)

train_loss=0.0203となっているので、overfitting状態になっている。

この状態から、Cutout条件を適正に設定することによって、LBスコアを0.89+にもっていくことができる筈だ。と思って取り組もう。カラーやノイズについても検討しよう。

4月6日（火）

＊＊＊また、文字変換の途中でフリーズして、フォーマットがぐちゃぐちゃになった。復元ボタンを使っても、フォーマットがぐちゃぐちゃのままで、全く使えない。無意味な復元ボタン！これで、何回目かな。なにもなくても（何かあったのだろうが気付かなかったのだろう）次に文書を開くとフォーマットが崩れていることもあった。＊＊＊

1時間半くらい、GPUの使用計画を立てていたのだが、消えてしまった。脳の中に少し残っているので書きとめておこう。

改善の対象とするデータ：2サイクル20エポック (up: 3 epochs, down: 7 epochs) x 2 : LB=0.870 (0.0203, 0.277)

検討課題１：Cutout：hole_numとホール面積の調整：ベース：hole_num=10, hole_size=1/10(hole_area=0.01)：2サイクル20エポックに適用して、train_lossが0.022～0.023くらいで、LB=0.88+になること。

検討課題２：Cutoutのベース条件で、train_lossが0.018レベルになること。何サイクル何エポックが最適かわからない。512 pixelで条件探索するのは時間がかかる。256 pixelで基本を押さえておこう。

実験：512 pixel

１．3サイクル30エポックで、train_lossが0.02以下になるかどうかを確認する。

２．hole_num=1, hole_area=0.1(hole_size=0.316)で、2サイクル20エポック条件の計算を行う。

予備実験：256 pixel

512 pixel 3サイクル30エポックで所望の結果（train_lossが0.018以下になること）が得られるかどうかを、256 pixelで確認する。

256 pixelでは、1サイクル10エポックはなく、2サイクル20エポックはあるが、3:7ではなく5:5である。3サイクル30エポックは3:7であるが、0.1-2.0ではなく0.1-3.0である。lr=0.1-3.0では、train_lossが、10E:0.0281, 20E:0.0255, 29E:0.0243, 30E:0.0253となっていて、LB=0.857（trfm_1：弱）であった。256 pixeで参考になるデータが少ないので、1から始めることになる。計算時間は512 pixelの1/3くらいですむが、基本データをとって、条件検討をするにはそれなりの時間がかかる。

ということで、256 pixelによる予備実験はとりやめにする。

まずは、512 pixelで、3サイクル30エポックの結果を確認しようか。

2サイクルで30エポックはどうだろうか。upは3エポックのままで、downを7エポックから12エポックに増やし、15エポックで0.21くらいまで下がれば、2サイクル目で0.185くらいまで下がることを期待するか。

ということは、1サイクルで15エポックがどうなるか調べておく必要がある。

1サイクル15エポック（up:3, down:12) : (0.0208, 0.0291) : LB=0.867

train_lossが0.021より少し低い値になったので、2サイクルで0.0185くらいになることを期待してもよさそうだ。

384 pixelの画像に対して、hole_num=1, hole_area=0.1(hole_size=0.316)と、hole_num=10, hole_size=0.1 (hole_area=0.01)を用いた結果を既に出していて、どちらを適用してもLBスコアが変わらなかったので、hole_num=1の条件を強くして適用することにしよう。

論文ではhole_area=0.25で良好な結果が得られると書かれていたように思うので、これを真似てみることにして、1サイクル15エポックに適用してみた。

1サイクル15エポック＆ hole_num=1, hole_area=0.25(hole_size=0.5) : (0.0293, 0.0347) : LB=0.878

train_lossは、0.0208から0.0293まで大きくなった。これはダメかな、ホールの面積が大きすぎたので、もう少し小さくしようかなと考えていたのだが、LBスコアが出てみると、それほど悪くない。使えそうだ！これで学習量を増やしていけば、LB=0.89+になるかもしれない。

サイクルを繰り返してみた。（CyclicLRというのは同じ学習を繰り返すことだということを、今初めて実感している。）

2サイクル30エポック＆ hole_num=1, hole_area=0.25(hole_size=0.5) : (0.0271, 0.0406) :

思っていたほど、15エポックより30エポックのtrain_lossのほうが小さくなっているので、学習を繰り返した効果があったということだが、30エポックのval_lossが大きいので、LBスコアは上がるのだろうかと、かなり心配しながら、commitしているところ。

結果は、LB=0.859でした。ホールサイズが大きすぎたようだ。次はホールサイズを少し小さくしようと思うが、GPUの残り時間が6時間半となったし、1サイクル10エポックシリーズも先に検討しておきたいことがあるので、この続きは今週の土曜日以降になる。

学習経過は、こんなふうになった。

1│ 0.1490│ 0.1265│ 2.98
2│ 0.0597│ 0.0598│ 2.93
3│ 0.0500│ 0.0583│ 2.94
4│ 0.0465│ 0.0400│ 2.95
5│ 0.0451│ 0.0450│ 2.94
6│ 0.0392│ 0.0489│ 2.95
7│ 0.0383│ 0.0482│ 2.95
8│ 0.0366│ 0.0410│ 2.94
9│ 0.0358│ 0.0388│ 2.94
10│ 0.0329│ 0.0441│ 2.96
11│ 0.0336│ 0.0362│ 2.96
12│ 0.0314│ 0.0415│ 2.98
13│ 0.0311│ 0.0444│ 2.97
14│ 0.0308│ 0.0376│ 2.95
15│ 0.0293│ 0.0347│ 3.00 : LB=0.878
16│ 0.0289│ 0.0384│ 2.95
17│ 0.0320│ 0.0620│ 2.97
18│ 0.0354│ 0.0527│ 2.97
19│ 0.0360│ 0.0528│ 2.98
20│ 0.0339│ 0.0427│ 3.00
21│ 0.0319│ 0.0343│ 2.97　　　　　　　　　　　　　　　　　　　　　　　　　　　22│ 0.0315│ 0.0400│ 2.96
23│ 0.0332│ 0.0341│ 2.97
24│ 0.0304│ 0.0471│ 2.98
25│ 0.0301│ 0.0453│ 2.97
26│ 0.0299│ 0.0352│ 2.98
27│ 0.0302│ 0.0404│ 2.96
28│ 0.0320│ 0.0360│ 2.98
29│ 0.0283│ 0.0365│ 2.96
30│ 0.0271│ 0.0406│ 2.95 : LB=0.859

この結果から、hole_area=0.25ではなく、0.20や0.15を使えば、15エポックでは、0.880を超えることができそうである。

1サイクル15エポックは、2サイクルに用いる繰り返しの単位としては、長すぎたようだ。最適な単位エポック数はいくらだろうか。6は短すぎたようだ。7は効果的だった。それ以上は検討できていない。間をとばして10と15を試みているというところだが、upとdownの比率は3:7が良好だが、upが2エポックの場合は1サイクルが7エポックくらい、upが3エポックの場合は1サイクルが10エポックとなる。upが4エポックの場合は、1サイクルが13エポックくらいとなる。上記の1サイクル15エポックはupが3サイクルなので、upとdownの比率は2:8となる。3:7は、OneCycleLRのデフォルトということであって、CyclicLRに対してのデフォルトではない。CyclicLRのデフォルトは5:5である。CuclicLRのupとdownの比率を5:5から3:7に変更したのは、CyclicLRのチューニング中にOneCycleLRのデフォルト設定値を思い出してちょっとやってみようと思っただけである。どのような比率がベストなのかはわからない。実験して探すしかない。元に戻ると、upが3エポックでdownが12エポックの1サイクル15エポックは、単独では悪いモデルではないのかもしれない。これをそのまま繰り返して2サイクル30エポックとして動作させた場合、今回は、良いモデルではなかったということ。

schedulerに限らず、気に入らなければ自分で書き換えれば良い。最初は既製品をデフォルトで使う。慣れてくると内部パラメータの適切な組み合わせを探すようになる。さらに進むと、既製品ではできないパラメータの組み合わせを試してみたくなる。このときに、元のコードに戻って、自分で書き換えて、既製品では不可能だった動作をさせることができるようになりたいものだ。

CyclicLRで使っていないパラメータがある。これら {triangular, triangular2, exp_range}の他に、自分で関数を入力することができる。

scale_fn (function) – Custom scaling policy defined by a single argument lambda function, where 0 <= scale_fn(x) <= 1 for all x >= 0. If specified, then ‘mode’ is ignored. Default: None

次のサイトに簡単な関数例が示されている。現状でも、まだできることがある。

https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/callbacks/cyclical_learning_rate.py

clr_fn = lambda x: 0.5*(1+np.sin(x*np.pi/2.))

4月7日（水）

HuBMAP： 1,397 teams, 34 days to go : me:0.880 ---> top:0.939

GPU : 6.5 h

2サイクル20エポック（train_loss=0.0203 : LB=0.870）については、3サイクル30エポックに、hole_num=1, hole_area=0.25(hole_size=0.5)、を適用する。

この予定だったが、2サイクル30エポックのLBスコア向上に失敗したので、この1サイクル10エポック系は、2サイクル20エポックで0.880を超えることを目指す。

面積率とホールサイズ：0.1≒0.316x0.316 : 0.15≒0.387x0.387 : 0.2≒0.447x0.447

Base : 1 cycle 10 epochs : num_hole=10, hole_size=0.1 x 0.1 : LB=0.880 (0.0235, 0.0287)

Run1 : 1 cycle 10 epochs : num_hole=1, hole_size=0.387 x 0.387 : LB=0.864 (0.0271, 0.0312)

この2つは直接比較できないが、Cutoutの全ホール面積が1.5倍になったことによってunderfittingの傾向になったと考える。次に、後者の条件でサイクルを繰り返すことによって、学習が進み、underfittingが解消されることによって、スコアが上がると予想した。train_loss, val_lossともに、小さくなっているので、予想通り、学習は進んだので、LBスコアは上がる筈だと思うのだが、上がるどころか、少し下がった。

Run2 : 2 cycle 20 epochs : num_hole=1, hole_size=0.387 x 0.387 : LB=0.860 (0.0243, 0.0305)

CutoutがLBスコアの向上に寄与しなかった理由として何が考えられるだろうか。使っているCutoutのパターンは、ホールが1個と10個の場合の2種類である。Cutoutのオリジナル論文は、ホールが1個、HuBMAPのコンペで自分が見た使用例は、ホールが10個、であり、そのことについてのコメントや注釈は見当たらなかった。ということから、これは、比較評価することが必要である。気になっていたのだが、1個のホールの方が良いと思い込んでいたかもしれない。ということで、予想通りのスコアが得られなかった原因として、Cutoutのパターンの選択が課題に対して適切でなかったということが考えられる。

1サイクル15エポック＆ hole_num=1, hole_area=0.25(hole_size=0.5) : (0.0293, 0.0347) : LB=0.878

この結果に対して、hole_area=0.20や0.15にすれば、LBスコアは0.880を超える筈だと書いた。これを確かめよう。

1サイクル15エポック＆ hole_num=1, hole_area=0.20(hole_size=0.447) : (0.0270, 0.0335) : LB=0.880

わずかに改善したが、超えられなかった。

上に書いたように（もっと前にも書いたような気がする）ホールが1個の場合と10個の場合の比較をする必要がある。もっといえば、適切なホールの個数と面積の組み合わせを探す必要がある。実験結果に基づかないことを根拠や指標にしていることがある（多い）ように思うので注意しよう。

予測性能を上げるために必要なことが何であるのか、正しく理解できていない。道具の使い方を間違っているのかもしれない。適用範囲を逸脱した領域で使っているのかもしれない。わからないことがたくさんある。何がわかっていないのかもわからない。

同一視野の画像の画素数による違い：EfficientNetB2-Unet, CyclicLR(SGD), lr=0.1-2.0, up 3 epochs, down 7 epochs, BCE, trfm_4

512 pixel

1│ 0.1387│ 0.0823│ 3.16
2│ 0.0468│ 0.0521│ 3.15
3│ 0.0382│ 0.0438│ 3.14
4│ 0.0372│ 0.0404│ 3.11
5│ 0.0335│ 0.0374│ 3.11
6│ 0.0297│ 0.0319│ 3.12
7│ 0.0270│ 0.0352│ 3.12
8│ 0.0259│ 0.0317│ 3.12
9│ 0.0248│ 0.0291│ 3.14
10│ 0.0235│ 0.0287│ 3.12 : LB=0.880

256 pixel

1│ 0.1422│ 0.0888│ 1.14
2│ 0.0595│ 0.0557│ 1.13
3│ 0.0519│ 0.1444│ 1.13
4│ 0.0484│ 0.0510│ 1.12
5│ 0.0414│ 0.0405│ 1.12
6│ 0.0396│ 0.0499│ 1.12
7│ 0.0372│ 0.0422│ 1.12
8│ 0.0350│ 0.0343│ 1.12
9│ 0.0315│ 0.0334│ 1.12
10│ 0.0308│ 0.0299│ 1.11 : LB=0.869

OneCycleLRを使っていた頃とはaugmentationの条件がかなり異なっている可能性が高いので、OneCycleLRについても再度計算（性能確認）する必要があるように思う。

4月8日（木）

HuBMAP：1,405 teams, 33 days to go

256 pixel, batch_size=24, CyclicLR(SGD, 0.1-2.0, 3:7, 298:696), trfm_4, v50

1│ 0.1572│ 0.0902│ 1.10
2│ 0.0589│ 0.0579│ 1.08
3│ 0.0475│ 0.0444│ 1.08
4│ 0.0437│ 0.0397│ 1.08
5│ 0.0393│ 0.0433│ 1.08
6│ 0.0368│ 0.0418│ 1.08
7│ 0.0344│ 0.0319│ 1.08
8│ 0.0331│ 0.0322│ 1.08
9│ 0.0302│ 0.0309│ 1.08
10│ 0.0295│ 0.0311│ 1.08　　　　　　　　　　　　　　　　　　　　　　　　　　　11│ 0.0290│ 0.0339│ 1.08
12│ 0.0295│ 0.0361│ 1.08
13│ 0.0309│ 0.0383│ 1.07
14│ 0.0327│ 0.0366│ 1.07
15│ 0.0318│ 0.0392│ 1.07
16│ 0.0303│ 0.0320│ 1.08
17│ 0.0298│ 0.0370│ 1.08
18│ 0.0285│ 0.0318│ 1.08
19│ 0.0269│ 0.0344│ 1.08
20│ 0.0255│ 0.0333│ 1.08 : LB=0.867

3サイクル20エポックで、batch_size=16-->24を検討するつもりだったが、ステップサイズを変更するのを忘れたために、当初の実験目的を外れたもの。

上に示した256 pixelの条件とbatch_size以外は全く同じで、1サイクル10エポック、を繰り返したものになっている。

1サイクル10エポックでは、バッチサイズを16から24に増やすことで、学習は少し効率よく進んだようにみえる。LBスコアへの影響を後日調べてみる。

EarlyStopping：

迷路に迷い込んでいるような気がしている。学習率は低減させていくのが常識だったし、train_lossとval_lossの動きを見ていて、両者が大きく乖離しないことと、val_lossが反転しない事の両方を満たすところで学習を停止するのが常識だったと思うのだが、OneCycleLRやCyclicLRはこれまで見たことが無いような変化をするので、惑わされてしまったように思う。OneCycleLRもCyclicLRも、設定しておいた最終エポックで良好な結果が得られるものだと思い込んでしまっていたように思う。

その流れで、512 pixel、20エポック3サイクル、学習率0.1-2.0で20エポックでのtrain_lossが0.0202になったのを見て、ここから、augmentationの条件を適切に調整することができさえすれば、スコアは上がっていくものと思い込んでしまった。

このことに気がついたからと言って、スコアが上がるわけではない。地道に、トライアンドエラー、系統的な実験、先人に学ぶ、ということを繰り返していく・・・。

EarlyStoppingの考え方に近い/関連性があるのは、サイクル数の繰り返しを減らす、最終エポック数を設定エポック数の少し手前に設定する、トータルエポック数を減らす、小さなモデル(今使っているモデルだと小さなEncoder）を用いる、画像の解像度を下げる、などであるから、結局はそれらの間のバランスをとりながら、LBスコアが最大になる組み合わせを探索し、アンサンブルやTTAを適切に組み合わせるということになるのだろう。言うは易し、行うは難し・・・。

LB=0.925のモデル（inferenceはコンペサイトのnotebook(code)、trainはGitHub）が公開されているので、そのあたりのスコアを望むだけなら真似すれば良いだけだが（とはいえ、口で言うほど簡単ではないが）、自分としては、今のやり方でスコアアップを目指すことが自分には必要だと思っている。すでに、その公開コードからエッセンスの一部を拝借しながら進めているので大きなことは言えないが、それでも、直接真似るのと、実験しながら進めるのとでは大きな違いがあると思っている。

4月10日（土）から4月15日（木）までの6日間の実験計画を立てよう！

目を閉じると、シナプスが活動しているのが見えるような気がする。そろそろ、良いアイデアが浮かんでもよさそうなものだが、浮かんでこないということは、努力が足りていないということなのだろう。集中力が足りないのか。深く考えるために必要な演算能力と記憶能力が不足しているのだろうか。記憶したものが片っ端から消去されているような気がする。

明日は、4月15日までに、LB=0.880 ---> 0.900、となることを目指して、実験計画を立てよう。0.900 - 0.880 = 0.001 x 20 : 0.001の改善を20回繰り返せば0.900に到達！

＊overfitting条件にしてから、augmentationを強めてjust-fittingにもっていこうとしていて条件探しの途中だが、min_lrとmax_lrの組み合わせ、1サイクルのエポック数、サイクル数、up_epochとdown_epochの組み合わせ、encoder、バッチサイズ、画素数なども並行して検討する必要がある。

限られた時間と資源の中で、効果的に実験する方法を考えよう。

教師データの量と質：バッチサイズ：モデル：損失関数：最適化関数：学習率スケジューラー

OneCycleLRから始めて、CyclicLRに移り、サイクル数や最大学習率、upとdownの比率などパラメータが多くて最適化はうまくいかず、最近の512 pixelの画像を使った実験では、CyclicLRを使いながら、11回の計算のうち、3回が2サイクルで8回は1サイクルであった。こうなった理由は、256 pixelの画像を使った場合よりも収束が速く、サイクルを繰り返すことによってlossを小さくする必要がなかったことにある。2回目に試みた1サイクル10エポックでLB=0.880になったが、それ以降はこれを超えることができていない。GPUを使った最後のテストは、512 pixelで0.88を得た1サイクル10エポックを256 pixelに適用することで、LB=0.869となった。1サイクルで十分ならば、OnsCycleLRがある。明日からのGPUマシンタイムは、OneCycleLRに集中するか1サイクルとの併用で進めようと思う。

バッチサイズの影響は、条件設定ミスで調べることができなかったので、優先度を上げて実施する。

Cutoutは、10ホールと1ホールの比較が不十分で、その中間状態（3ホールとか5ホール）については、全く調べることができていない。

256 pixelと512 pixelの間（320, 384, 448 pixelなど）はCutoutの条件設定ミスを含んだ結果があるだけで、0.877となった320 pixelすら、フォローできていない。これら中間的な解像度で良い結果が得られる可能性は高いと思うので、順に結果を出していく。

Encoderは、EfficientNetB2に固定して進めているが、underfittingになればB3やB4を試し、overfittingになればB1やB0を試すなど、最適な組み合わせを探っていく。

Decoderは、2月頃はFPNが良さそうであったが、3月末に調べたときには、FPNは収束が悪く、LinknetとMAnetは使えそうであったが、Unetに優るような感じはしなかった。

損失関数は、BCEから始めてDiceに落ち着きそうだったが、ある条件でDiceLossでは収束が遅くなったときにBCEで良好な結果が得られてからBCE単独で進めている。他の損失関数との組み合わせもどこかのタイミングで試してみたいと思っている。今回のマシンタイムに組み込もう。

4月10日（土）の予定

最初に、OneCycleLRの10エポックと20エポックの性能を評価し、CyclicLRで得られている結果と比較する。

Run1&2 : 256 pixelのスコアLB=0.875を、OneCycleLR（Adam）とOneCycleLR（SGD）で再現できるかどうかを調べる。CyclicLR（SGD）は3サイクル20エポックなので、20エポックで試す。

Run3&4 : 256 pixelのスコアLB=0.869を、OneCycleLR（Adam）とOneCycleLR（SGD）で再現できるかどうかを調べる。CyclicLR（SGD）は1サイクル10エポックなので、10エポックで試す。

4月11日（日）の予定：

256 pixelの画像を用いて、3サイクル20エポックで、LB=0.875を得ていたモデルとパラメータを用いて、1サイクルｎエポックで同等のLBスコアが得られる条件を見出す。

CyclicLRによる1サイクル10エポックでは、LB=0.869が得られているので、エポック数を増やすだけで0.875に近い値が得られるのではないかと思っている。このときに、4月10日に検討したOneCycleLRの方が優れていれば、OneCycleLRを用いて、エポック数の最適な値を探すことになる。

4月12日（月）の予定：

Cutoutの1ホールと3ホールと10ホールのトータル面積を等しくしてその効果の比較を行う。面積率は0.10とする。

4月13日（火）の予定：

バッチ数の効果を調べる。これまでの16をベースとして、24と32について調べる。

4月14日（水）の予定：

画素数の効果を調べる。256, 320, 384, 448, 512とし、256 pixelでの最大スコアが得られている条件をベースとして、画素数が増えるごとに、1エポックづつエポック数を減らして計算する。

たとえば、512 pixelでは、10エポックで0.880が得られているので、これをスタートラインとするならば、448 pixelでは11エポック、384 pixelでは12エポック、320 pixelでは13エポック、256 pixelでは14エポック、のように画像解像度が低いほどエポック数を増やす。

ただし、512 pixelありきで考えるよりは、256 pixelの方から積み上げていく方が、最終スコアは上がるのではないかと思っている。

4月15日（木）の予定：

損失関数の検討。upとdownの割合、max_lr、

＜CPUを用いた予備検討＞

CPUを用い、同一条件で、ResNet34と EfficientNetB0の学習結果を比較してみる。

model = smp.Unet('resnet34', encoder_weights='imagenet', classes=1), batch_size=16, CyclicLR(SGD), lr=0.1-2.0, up=3 epochs, down=7 epochs

1│ 0.1392│ 0.1253│ 25.53
2│ 0.0850│ 0.5311│ 25.57
3│ 0.1121│ 0.0932│ 25.49
4│ 0.0781│ 0.0548│ 25.21
5│ 0.0607│ 0.0632│ 25.28
6│ 0.0568│ 0.0721│ 25.40
7│ 0.0501│ 0.0496│ 25.34
8│ 0.0479│ 0.0408│ 25.31
9│ 0.0426│ 0.0397│ 25.25
10│ 0.0405│ 0.0372│ 25.37

model = smp.Unet('efficientnet-b0', encoder_weights='imagenet', classes=1), CyclicLR(SGD), lr=0.1-2.0, batch_size=16, up=3 epochs, down=7 epochs
1│ 0.1533│ 0.1128│ 20.32 2│ 0.0657│ 0.3048│ 20.44
3│ 0.0649│ 0.0624│ 20.31
4│ 0.0587│ 0.0810│ 20.21
5│ 0.0490│ 0.0474│ 20.67
6│ 0.0444│ 0.0669│ 20.60
7│ 0.0430│ 0.0385│ 20.67
8│ 0.0381│ 0.0359│ 20.63
9│ 0.0358│ 0.0337│ 20.44
10│ 0.0333│ 0.0360│ 20.44

以上の2つの結果は、resnet34とEfficientNetB0との違いを示すものである。これらのモデルで予測した結果がどうなるかまでは調べていないので安易な評価をしてはいけないかもしれないが、train_loss, val\lossともにEfficientNetB0の方が小さくなっているので、後者の方が速く収束しているとみてもよいのだろうと思う。

次に、model = smp.Unet('efficientnet-b0', encoder_weights='imagenet', classes=1)に対して、OneCycleLR(SGD, max_lr=2, epochs=10)を適用してみる。

1│ 0.1547│ 0.0819│ 21.97
2│ 0.0648│ 0.6287│ 21.18
3│ 0.0905│ 0.1146│ 21.12
4│ 0.0600│ 0.0506│ 21.11
5│ 0.0493│ 0.0416│ 21.15
6│ 0.0458│ 0.0681│ 21.23
7│ 0.0418│ 0.0395│ 21.25
8│ 0.0388│ 0.0364│ 21.19
9│ 0.0355│ 0.0330│ 21.24
10│ 0.0334│ 0.0337│ 21.48

CyclicLRの1サイクルとOneCycleLRとで見た目には大差ない結果となっているように見える。同じ最適化関数とmax_lrを用いているが、異なるところは、学習率lrの変化が、CyclicLRが三角波であるのに対して、OneCycleLRはコサイン関数（三角波を選ぶこともできる）となっていることである。さらに、OneCycleLRでは、学習率lrの初期値と最終値を別々に指定することができる。デフォルトではlrの最終値はlrの初期値（デフォルトではmax_lrの1/25）より4桁小さい。

GPUが使えるようになれば、予測結果をsubmitしてLBスコアを比較する予定である。

最後に、model = smp.Unet('efficientnet-b0', encoder_weights='imagenet', classes=1)に対して、AdamWを最適化関数に用いた、OneCycleLR(AdamW, max_lr=1e-3, epochs=10)を試してみよう。これは、CyclicLRを使う前に、1か月以上使っていたものである。

1│ 0.4280│ 0.1768│ 24.61
2│ 0.1092│ 0.0692│ 24.71
3│ 0.0607│ 0.0490│ 25.10
4│ 0.0479│ 0.0434│ 25.04
5│ 0.0439│ 0.0369│ 24.94
6│ 0.0403│ 0.0374│ 24.98
7│ 0.0371│ 0.0331│ 25.30
8│ 0.0348│ 0.0341│ 25.21
9│ 0.0327│ 0.0351│ 25.25
10│ 0.0320│ 0.0341│ 25.23

id predicted
0 aa05346ff 15325 185 46045 185 76764 187 107482 189 13820...
1 2ec3f1bb9 60690327 20 60714307 41 60738287 56 60762271 6...
2 3589adb90 68511797 12 68541223 56 68570651 65 68600081 7...
3 d488c759a 62493304 1 62539963 4 62586623 5 62633282 6 62...
4 57512b7f1 217865569 2 217898801 14 217898952 3 217932039...

4月10日（土）～4月16日（金）

GPU : 40 h

目標：LB=0.880 ---> LB=0.900

手段：256 pixelのLB=0.875から、augmentation, batch_size, loss_fn, optimizer, learning_scheduler, pixel_size, encoder-decoder,などの最適化

4月10日（土）

GPU : 40 h

HuBMAP：1,421 teams, 31 days to go : top LB score 0.943

今日は、OneCycleLRを用いて、256 pixelの自己記録LB=0.875を超えることを目指す。

CyclicLR(SGD, 0.1-2.0)の、1サイクル10エポックでLB=0.869、3サイクル20エポックでLB=0.875が得られている。

さて、このような状況で、本日最初の一手をどう指せばよいだろうか。これまでの248回のsubmissionデータを完璧な機械学習モデルに入力し学習させたら、どのような出力が出てくるだろうか。「入力データの質が悪すぎて予測不可能」と出力されそう。

5月10日のLBスコアの予測なら、簡単に答えが出てきそうだ。そんなのは、方眼紙にプロットすればすぐ出てくる。

2月22日（月）：0.836（EfficientNetB5-Unet,scse, DiceLoss, OneCycleLR(Adam, max_lr=1e-3, 30 epochs）

3月20日（土）：0.850（EfficientNetB2-Unet, BCELoss, OneCycleLR(Adam, max_lr=1e-3, 20 epochs, : train_loss= 0.0219, val_loss=0.0356)：更新データ8件使用

同日：0.865 更新データ10件使用（データ更新によりtrain_dataは8件から15件に増えたがメモリーの制約により今と同じ10件のみ使用）: (train_loss=0.0238, val_loss=0.0311)

3月24日（水）：0.875, Cutout

3月28日（日）：0.877, 320 pixel

4月5日（月）：0.880, 512 pixel

5月10日（月）：0.95？？？：妄想による予測結果

OneCycleLR：

Adam or SGD

max_lr：Adamなら1e-3, SGDなら2.0

up/(up+down)：デフォルトは0.3

エポック数 10：CyclicLRの1サイクル10エポック(v12)との比較、及び、baselineとして

その他：encoder-decoder, pixel_size, batch_size, loss_fn, augmentation,

b2-Unet, 256 pixel, batch_size=16, BCELoss, Cutout 10, 10%

1サイクル10エポックLB=0.869のlossは、(0.0308, 0.0299)であったのに対し、OneCycleLR(SGD, max_lr=2.0, epochs=10)では、(0.0310, 0.0315)であった。

結果：LB=0.866（-0.003となった）：0.869より少し低いが、CyclicLR(SGD)との比較ということでは、これで十分である。

次のターゲットは、3サイクル20エポックで得られたLB=0.875である。これには、エポック数を増やすことが必要だが、どこから始めようか。

それとも、エポック数10で、バッチ数の効果を先に調べておくか。

batch_size=16 : train_loss=0.0310, val_loss=0.0315 : LB=0.866

batch_size=24 : train_loss=0.0283, val_loss=0.0302 : LB=0.868

batch_size=32 : train_loss=0.0311, val_loss=0.0328 : commit, submitせず

batch_sizeを24に増やすと収束が良くなり、LBスコアも少し上がったが、32に増やすと収束が悪くなった。batch_sizeを大きくすると、GPUメモリー占有量が増えて、大きな画素数の画像が扱えなくなることも加味して、batch_sizeは、16を継続する。

さて、エポック数をどうしようか。

15エポックと20エポックは、計算しておく必要がありそうだ。

OneCycleLR(SGD)のエポック数を増やすだけではなく、max_lrやupとdownの比率などの調整が必要になるかもしれない。

次の2つの場合について実験する：まずは、LB=0.869を越えないと。

OneCycleLR(SGD, max_lr=2.0, epochs=15) : loss (0.0273, 0.0308) : LB=0.858

予測結果が、これまでと違っていて心配だ。LB=0.858となり、悪い予想が的中した。

10エポックの結果はこうなっていた。

OneCycleLR(SGD, max_lr=2.0, epochs=10, pct_start=0.3) : loss (0.0310, 0.0315) : LB=0.866

train_lossとval_lossの値は、常識的には、15エポックの方が、かなり優れている。にもかかわらず、10エポックよりもLBスコアが悪い。

こうなると、次に20エポックを試してもだめだろう。

15エポックの方が悪くなる原因として考えられるのは、15エポックの中身で、upが3割だから、内部では、upが4.5エポック、downが10.5エポックとなっている。10エポックのときは、upが3エポックで、downが7エポックである。10エポックと15エポックの本質的な違いは、学習率の変化率である、10エポックの場合は、3エポックで最大学習率2.0に達するが、15エポックでは、4.5エポックで最大学習率に達する。この違いが、Super-Convergenceの効果に大きな違いを生じている可能性がある。

とりあえず、15エポックの場合に、3エポックで学習率が最大になるように、pct_startを0.3から0.2に変更してみる。そうするとupが3エポックで、downが12エポックとなる。

OneCycleLRも、CyclicLRも性能を発揮するための条件は、Super-Convergenceなので、それを満たす条件から外れると、通常の学習率の制御方法を用いた場合よりも、非常に悪い結果しか得られないということになるのではないかと思う。

ということは、train_lossとval_lossの組み合わせに対して、従来の常識に従って判断すると、良い結果を見逃し、悪い結果に固執してしまうことにもなりかねない。

自分が落ち込んでいた罠の正体の何割かは、これによるものではないだろうか。

非常に気になったので、"super convergence deep learning" で検索したら、Super-Convergenceを理論的に解明した（らしい）論文があった。

Super-Convergence with an Unstable Learning Rate
Samet Oymak∗, arXiv:2102.10734v1 [cs.LG] 22 Feb 2021
Abstract
Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don’t blow up. This note introduces a simple scenario where an unstable learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they claim CLR with a large maximum learning rate leads to “super-convergence”. We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstable step is the key to enabling fast convergence over the small eigen-spectrum.

　従来の通念では、勾配ベースのアルゴリズムが爆発しないように、学習率は安定した状態にある必要があります。このノートでは、不安定な学習率スキームが超高速収束につながる単純なシナリオを紹介します。収束率は、問題の条件数に対数的にのみ依存します。私たちのスキームは、不安定性を補うために、定期的に1つの大きな不安定なステップといくつかの小さな安定したステップを実行する循環学習率を使用します。これらの調査結果は、[Smith and Topin、2019]の経験的観察を説明するのにも役立ちます。ここでは、最大学習率が高いCLRが「超収束」につながると主張しています。私たちのスキームは、ヘッセ行列が二峰性スペクトルを示し、固有値を2つのクラスター（小さいものと大きいもの）にグループ化できる問題に優れていることを証明します。不安定なステップは、小さな固有スペクトルでの高速収束を可能にするための鍵です。（Google翻訳）

原理がわかっても、答えはわからない。個々の条件/状況によって実際に適用するときの数値は異なる。

15エポックの場合に、3エポックで学習率が最大になるように、pct_startを0.3から0.2に変更してみた。（upが3エポックで、downが12エポック）

OneCycleLR(SGD, max_lr=2.0, epochs=15, pct_start=0.2) : loss (0.0272, 0.0320) : LB=0.865

最後に、12エポックを試しておく。pct_start=0.25としておけば、3エポックで学習率が最大になる。

OneCycleLR(SGD, max_lr=2.0, epochs=12, pct_start=0.25) : loss (0.0289, 0.0325) : LB=0.866

得られた結果を並べてみよう。3エポックまでは、全く同じであることがわかる。

OneCycleLR(SGD, max_lr=2., epochs=10, pct_start=0.3) : loss (0.0310, 0.0315) : LB=0.866

1│ 0.1526│ 0.2327│ 1.12
2│ 0.0623│ 0.0656│ 1.10
3│ 0.0517│ 0.0459│ 1.11
4│ 0.0494│ 0.0451│ 1.10
5│ 0.0426│ 0.0481│ 1.10
6│ 0.0418│ 0.0406│ 1.11
7│ 0.0369│ 0.0373│ 1.11
8│ 0.0359│ 0.0387│ 1.10
9│ 0.0320│ 0.0339│ 1.11
10│ 0.0310│ 0.0315│ 1.10

OneCycleLR(SGD, max_lr=2., epochs=12, pct_start=0.25) : loss (0.0289, 0.0325) : LB=0.866

1│ 0.1526│ 0.2327│ 1.14
2│ 0.0623│ 0.0656│ 1.11
3│ 0.0517│ 0.0459│ 1.12
4│ 0.0494│ 0.0452│ 1.12
5│ 0.0430│ 0.0498│ 1.13
6│ 0.0417│ 0.0451│ 1.12
7│ 0.0374│ 0.0395│ 1.12
8│ 0.0373│ 0.0427│ 1.12
9│ 0.0334│ 0.0355│ 1.12
10│ 0.0320│ 0.0330│ 1.12
11│ 0.0299│ 0.0332│ 1.11
12│ 0.0289│ 0.0325│ 1.11

OneCycleLR(SGD, max_lr=2., epochs=15, pct_start=0.2) : loss (0.0272, 0.0320) : LB=0.865

1│ 0.1526│ 0.2327│ 1.13
2│ 0.0623│ 0.0656│ 1.11
3│ 0.0517│ 0.0459│ 1.11
4│ 0.0495│ 0.0452│ 1.11
5│ 0.0430│ 0.0549│ 1.10
6│ 0.0426│ 0.0649│ 1.11
7│ 0.0397│ 0.0424│ 1.12
8│ 0.0386│ 0.0389│ 1.12
9│ 0.0345│ 0.0354│ 1.11
10│ 0.0330│ 0.0335│ 1.11
11│ 0.0315│ 0.0330│ 1.12
12│ 0.0301│ 0.0373│ 1.11
13│ 0.0284│ 0.0322│ 1.11
14│ 0.0275│ 0.0352│ 1.11
15│ 0.0272│ 0.0320│ 1.10

明日は、3エポックで学習率が最大になる条件で、最大学習率を極限まで上げてみよう。CyclicLRを使っているときに、最大学習率が大きすぎると、lossが一旦大きくなって下がった後、収束しなくなるので、そうならない最大の値に近い値に設定して、LBスコアがどうなるかを調べてみよう。

256 pixelでLB=0.875となった、3サイクル20エポックの場合は、2エポックで最大学習率が2.0となる条件であった。

512 pixelでLB=0.880となった1サイクル10エポックの場合は、3エポックで最大学習率が2.0となる条件であった。

画素数によってもSuper-Convergenceの条件は異なるかもしれない。EfficientNetのB番号によってもSC条件は異なるかもしれない。FPNが収束しなかったのもSC条件が合わなかったことによるのかもしれない。

4月11日（日）

HuBMAP：1,429 teams, 30 days to go

エポック数10、12、15、は、それぞれ、up=3, down=7、up=3, down=9、up=3, down=12であった。これらの立ち上がりエポック数を2にして、立下りをup=3の場合と同じにすると、それぞれ、up=2, down=7、up=2, down=9、up=2, down=12となる。そうするためには、若干誤差を含むが、pct_startを、それぞれ、0.222、0.182、0.143とすればよい。以下の3通りの条件で計算してみよう。

OneCycleLR(SGD, max_lr=2., epochs=9, pct_start=0.222) : LB=0.861

OneCycleLR(SGD, max_lr=2., epochs=11, pct_start=0.182)

OneCycleLR(SGD, max_lr=2., epochs=14, pct_start=0.143)

OneCycleLRは、デフォルトのコサインモード（anneal_strategy='cos'）にしているが、CyclicLRは、"triangle"すなわち線形モード（anneal_strategy='linear'）であった。

OneCycleLRでは、両方のモードが使えるので、その比較をやってみよう。

OneCycleLR(SGD, max_lr=2., epochs=9, pct_start=0.222, anneal_strategy='linear') : LB=0.863

OneCycleLR(SGD, max_lr=3., epochs=9, pct_start=0.222, anneal_strategy='linear') :

max_lr=3と大きくしたことで、lossは(0.0608, 0.0494)までしか下がらなかったので、エポック数を増やしてみる。CyclicLRなら、downのエポックを増やすか、もう1サイクル追加すればlossはさらに下がるのだが、OneCycleLRでは、downのエポック数を増やすという選択肢だけである。

OneCycleLR(SGD, max_lr=3., epochs=14, pct_start=0.143, anneal_strategy='linear') :

2エポック目まで同じ条件のはずだが、2エポック目でlossが数桁大きくなり、収束しなくなった。

OneCycleLR(SGD, max_lr=3., epochs=14, pct_start=0.214, anneal_strategy='linear')

upを3エポックにしてみた。こういうことをしていると、もう、OneCycleLRを使う意味がないという気がしてきた。LB=0.856だった。コサインアニールに戻そう。

OneCycleLR(SGD, max_lr=3., epochs=14, pct_start=0.214, anneal_strategy='cos')

LB=0.857だった。低空飛行が続いています。

比較のために、1か月前のOneCycleLRのデフォルトを試して、OneCycleLRを終わろう。

OneCycleLR(AdamW, max_lr=1e-3., epochs=14) : LB=0.867

0.85+～0.86+をウロウロしていただけ！

OneCycleLRよりもCyclicLRの方が良さそうだ。上にも示した、次の論文に従って、CyclicLRに戻ろうか。

Super-Convergence with an Unstable Learning Rate
Samet Oymak∗, arXiv:2102.10734v1 [cs.LG] 22 Feb 2021

Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability.

one large unstable step and several small stable steps

256 pixel, 1サイクル10エポック、0.1-2.0, up=3, down=7, trfm_4, B2-Unet

1│ 0.1422│ 0.0888│ 1.15
2│ 0.0595│ 0.0557│ 1.13
3│ 0.0519│ 0.1444│ 1.14
4│ 0.0484│ 0.0510│ 1.12
5│ 0.0414│ 0.0405│ 1.13
6│ 0.0396│ 0.0499│ 1.13
7│ 0.0372│ 0.0422│ 1.13
8│ 0.0350│ 0.0343│ 1.13
9│ 0.0315│ 0.0334│ 1.13
10│ 0.0308│ 0.0299│ 1.12 : LB=0.869

このデータもって、CyclicLRの再出発としよう。256 pixelで、LB=0.875を超えよう。

4月12日（月）

HuBMAP：1,435 teams, 29 days to go

SGDはCyclicLRではnesterov=Trueとして用いてきたが、これをFalseにしてみた。結果は、LB=0.867(v1)であった。Trueの場合のLB=0.869と大差なかった。以後もnesterov=Trueにしておこう。

max_lrを0.25に上げてみよう。結果は、LB=0.856(v2)となった。upが3エポック、downが7エポックで、変えていないのだが、lossがいったん下がってから上昇したときの最大値が4エポック目で、5エポック目もlossの下がりが少なく、最後の10エポックでもlossは十分下がらなかった（0.322, 0.314）。

次は、max_lrを1.5に下げてみよう。結果は、LB=0.865(v3)となった（0.0302, 0.0309）。

極大は、max_lr=2.0と1.5の間のようだ。

max_lrをどこまで上げられるかを調べることの方が重要なので、max_lr=3.0にしてみよう。3エポック目にval_lossが0.4を超えたが、10エポック目には（0.0315, 0.0302）となり、LB=0.867(v4)となった。max_lr=2.5でLB=0.856となったので駄目だろうなと思ったのだが、予想外の結果となった。

1│ 0.1369│ 0.0986│ 0.5470│ 0.7288│ 1.14
2│ 0.0601│ 0.0554│ 0.7964│ 0.8399│ 1.12
3│ 0.0509│ 0.4107│ 0.8279│ 0.3687│ 1.13
4│ 0.0524│ 0.0696│ 0.8217│ 0.7586│ 1.12
5│ 0.0458│ 0.0572│ 0.8440│ 0.8088│ 1.12
6│ 0.0449│ 0.0463│ 0.8454│ 0.8219│ 1.12
7│ 0.0400│ 0.0525│ 0.8623│ 0.8356│ 1.11
8│ 0.0355│ 0.0347│ 0.8782│ 0.8721│ 1.12
9│ 0.0330│ 0.0346│ 0.8840│ 0.8820│ 1.12
10│ 0.0315│ 0.0302│ 0.8907│ 0.8912│ 1.12 : LB=0.867(v4)

それではと、max_lyを5.0にしてみとところ、val_lossが1.0を超え、train_lossもval_lossも0.28くらいから下がらなくなった。これはダメのようだ。

max_lr=4.0を試すことにした。val\lossが3エポック目に0.5以上、5エポック目に0.3以上となってどうなることかと思ったが、10エポック目には（0.0316, 0.0311）となり、LB=0.861(v5)となった。

1│ 0.1368│ 0.0944│ 0.5481│ 0.7352│ 1.14
2│ 0.0648│ 0.1624│ 0.7842│ 0.6723│ 1.12
3│ 0.0546│ 0.5056│ 0.8184│ 0.5011│ 1.13
4│ 0.0689│ 0.0789│ 0.7807│ 0.7328│ 1.12
5│ 0.0472│ 0.3160│ 0.8415│ 0.2536│ 1.12
6│ 0.0480│ 0.0460│ 0.8402│ 0.8534│ 1.12
7│ 0.0388│ 0.0496│ 0.8685│ 0.8269│ 1.12
8│ 0.0361│ 0.0363│ 0.8769│ 0.8597│ 1.13
9│ 0.0337│ 0.0353│ 0.8818│ 0.8803│ 1.12
10│ 0.0316│ 0.0311│ 0.8897│ 0.8883│ 1.11 : LB=0.861(v5)

今日はここまで。

次はこれらを2サイクルにしてみよう。2サイクルにしてoverfittingするようであれば、downのエポック数を減らしてみよう。(up:3, down:7) x 2 --> (up:3, down:6) x 2 --> (up:3, down:5) x 2

その前に、1エポックのままで、かつ、upは3エポックのままで、downを増やしてみようと思う。最終エポックの10エポック目で、train_lossよりもval_lossが小さいからであるが、downを1エポック増やしてみたら、val_lossの方が大きくなった。

＊一向に進まない。次の論文を読もう。

GRADIENT DESCENT ON NEURAL NETWORKS TYPICALLY OCCURS AT THE EDGE OF STABILITY
Jeremy Cohen Simran Kaur Yuanzhi Li J. Zico Kolter1 and Ameet Talwalkar2
Carnegie Mellon University and: 1Bosch AI 2 Determined AI
Correspondence to: jeremycohen@cmu.edu

arXiv:2103.00065v1 [cs.LG] 26 Feb 2021, Published as a conference paper at ICLR 2021

4月13日（火）

HubMAP：1,445 teams, 28 days to go

上記の論文、難しくて理解が進まないのだが、勾配降下法で最適解に到達しようとするときに、従来の方法で到達する最適解とは違うところに、より汎用性の高い解が存在し、そこに到達するためには、従来のtraining方法とは異なる考え方が必要だということが書かれているように思う。

f:id:AI_ML_DL:20210413092141p:plain

max_lr=2, (up=3, down=7) x 2で、LB=0.875となった。

1│ 0.1422│ 0.0888│ 0.5264│ 0.7703│ 1.13
2│ 0.0595│ 0.0557│ 0.8006│ 0.8261│ 1.12
3│ 0.0519│ 0.1444│ 0.8270│ 0.5969│ 1.13
4│ 0.0484│ 0.0510│ 0.8348│ 0.8596│ 1.12
5│ 0.0414│ 0.0405│ 0.8568│ 0.8779│ 1.12
6│ 0.0396│ 0.0499│ 0.8632│ 0.8479│ 1.12
7│ 0.0372│ 0.0422│ 0.8725│ 0.8453│ 1.13
8│ 0.0350│ 0.0343│ 0.8806│ 0.8712│ 1.12
9│ 0.0315│ 0.0334│ 0.8888│ 0.8898│ 1.12
10│ 0.0308│ 0.0299│ 0.8922│ 0.8917│ 1.12 : LB=0.869
11│ 0.0300│ 0.0350│ 0.8935│ 0.8831│ 1.13
12│ 0.0327│ 0.0377│ 0.8852│ 0.8755│ 1.11
13│ 0.0339│ 0.0381│ 0.8823│ 0.8691│ 1.12
14│ 0.0338│ 0.0362│ 0.8835│ 0.8727│ 1.14
15│ 0.0326│ 0.0337│ 0.8858│ 0.8754│ 1.14
16│ 0.0309│ 0.0336│ 0.8920│ 0.8885│ 1.12
17│ 0.0308│ 0.0480│ 0.8921│ 0.8323│ 1.13
18│ 0.0304│ 0.0320│ 0.8926│ 0.8868│ 1.14
19│ 0.0279│ 0.0399│ 0.9009│ 0.8760│ 1.14
20│ 0.0261│ 0.0325│ 0.9079│ 0.8883│ 1.14 : LB=0.875

3サイクル20エポックに加えて、2サイクル20エポックでもLB=0.0875になった。

1サイクル20エポックも試してみるかな。

max_lr=2.0を常用しているが、上限を確認するために3.0, 4.0, 5.0と変えてみた。upのエポック数は3である。upが2エポックの場合、max_lr=3ではlossが異常値を示して収束しなくなった。upを3エポックにすると、max_lr=5以外は、収束した。

SGDのnesterovはTrueで使っているが、画素数を多くしたときのoverfittingを抑制する効果があるかもしれないと思い、nesterov=Falseを試し始めた。

・収束は、1サイクル10エポックの場合、1エポックから2エポックくらい遅いようだ。

・384pixelの場合に、max_lr=3で異常値を示して収束しなかったが、nesterov=Falseにすると、異常値を示さず、収束した。

明日は、画素数の効果を調べる。

CyclicLRで最も簡単な1サイクル10エポックを用いる。

max_lrは、3または4を用いる。

4月14日（水）

HuBMAP：1,453 teams, 27 days to go

CyclicLRで最も簡単な1サイクル10エポックを用い、max_lrは3または4を用いて、画素数320, 384, 448, および512 pixelについてtrain, inference, commit, submitした結果、512 pixelの画像を用いた場合のLB=0.880が最良となった。ということで、またしても記録更新ならず。

SGDのnestrov=Falseは、max_lr=3でlossが異常値を示して収束しなかったものについて適用したところ、lossが順調に収束するということでは良かったのだが、スコアは向上しなかった。

4月15日（木）

HuBMAP：1,454 teams, 26 days to go

今日は、EfficientNetB2-UnetでLB=0.869が得られた条件のままで、B2を、B3, B4, B5, B6, B7と変えることにより、LBスコアがどうなるかを調べる。エポック数、サイクル数、Max_lr、upとqornの比率など、とても調べ切れるものではなく、最適値の組み合わせとは程遠いかもしれないが、ともかく、どういう傾向になるか調べてみよう。

256 pixel :

B2-Unet : LB=0.869

b3-Unet : LB=0.858

B4-Unet : LB=0.873

B5-Unet : LB=0.860

B6-Unet : LB=0.875

B7-Unet : LB=0.868

LBスコアとしては、期待外れ。ちょっと面白いのは、この並びでは、でこぼこしていて傾向が見えにくいが、偶数組と奇数組に分けて並べると、各組の中ではモデルが大きいほどLBスコアが高くなっている、ということがわかる。

B2-Unet : LB=0.869

B4-Unet : LB=0.873

B6-Unet : LB=0.875

b3-Unet : LB=0.858

B5-Unet : LB=0.860

B7-Unet : LB=0.868

trainingの経過データをいくつか掲載しているが、overfittingといっても、train_diceは0.91止まり。これでは、トップレベルのLBスコアを出している人たちのval_diceである、0.92なんておぼつかない。

256 pixel : CyclicLR(SGD, up:down=3:7, lr=0.1-2.0), 2サイクル20エポックのtraining結果

1│ 0.1489│ 0.0727│ 0.4805│ 0.7530│ 2.03
2│ 0.0564│ 0.0646│ 0.8073│ 0.8146│ 2.02
3│ 0.0567│ 0.0586│ 0.8260│ 0.8512│ 2.02
4│ 0.0455│ 0.0428│ 0.8513│ 0.8492│ 2.02
5│ 0.0386│ 0.0387│ 0.8714│ 0.8745│ 2.02
6│ 0.0359│ 0.0394│ 0.8789│ 0.8570│ 2.02
7│ 0.0333│ 0.0356│ 0.8855│ 0.8809│ 2.02
8│ 0.0309│ 0.0304│ 0.8932│ 0.9017│ 2.01
9│ 0.0284│ 0.0304│ 0.9003│ 0.8987│ 2.02
10│ 0.0272│ 0.0304│ 0.9039│ 0.8999│ 2.01 : LB=0.875
11│ 0.0267│ 0.0314│ 0.9051│ 0.8969│ 2.02
12│ 0.0292│ 0.0375│ 0.8988│ 0.8889│ 2.01
13│ 0.0304│ 0.0474│ 0.8982│ 0.8721│ 2.02
14│ 0.0316│ 0.0331│ 0.8939│ 0.8980│ 2.02
15│ 0.0285│ 0.0346│ 0.9011│ 0.8837│ 2.01
16│ 0.0264│ 0.0329│ 0.9074│ 0.8894│ 2.02
17│ 0.0258│ 0.0301│ 0.9091│ 0.9040│ 2.02
18│ 0.0256│ 0.0295│ 0.9104│ 0.9013│ 2.01
19│ 0.0239│ 0.0300│ 0.9146│ 0.9062│ 2.02
20│ 0.0245│ 0.0291│ 0.9149│ 0.9007│ 2.02 : LB=0.872

4月16日（金）

HuBMAP： 1,459 teams, 25 days to go

モデルのアンサンブルとTTAによって、LB=0.880の壁を突破したいと思うのだが、それよりも、まだ、CyclicLRで調べたいことがある。

4月17日（土）

GPU : 36 ｈ

HuBMAP：1,461 teams, 24 days to go

何度も同じ文献を見ているように思うが、都度、観点は異なる。今日は、サイクル数を繰り返すことによって改善される場合と、改善されない場合について整理してみようと思う。そのときに、思い出されるのが次の論文のFigure 2(b)である。1サイクルで最良値が得られ、サイクルを繰り返しても最初の値を超えていないように見える。

EXPLORING LOSS FUNCTION TOPOLOGY WITH CYCLICAL LEARNING RATES

L. N. Smith and N. Topin, arXiv:1702.04283v1 [cs.LG] 14 Feb 2017

f:id:AI_ML_DL:20210417091026p:plain

これまでに得られた結果が示唆するもの

１．1サイクルで良好な結果が得られる条件では、2サイクル、3サイクルと繰り返しても、1サイクルの結果を超えない。

２．3サイクルで良好な結果が得られる条件では、2サイクル、1サイクルでは3サイクルの結果を超えない。

３．画像解像度を上げることは、スコアアップにつながる。

４．Encoderのモデルサイズを大きくすることは、低解像度の画像を用いた場合に、スコアアップにつながる。

512 pixel, 1サイクル10エポック、lr=0.1-2.0で、(up=4, down=6)と、(up=down=5)について計算してみよう。

up=5, down=5 : LB=0.868

up=4, down=6 : LB=0.872

up=3, down=7 : LB=0.880

up=2, down=8 : LB=0.872

Cutoutの効果を調べる。

num_holes=1, max_h_size=int(0.316 * 512)：（0.316 * 0.316 ≒ 0.1）

1│ 0.1379│ 0.0758│ 0.5463│ 0.7905│ 3.02
2│ 0.0502│ 0.0560│ 0.8334│ 0.8330│ 2.99
3│ 0.0394│ 0.0866│ 0.8675│ 0.7177│ 3.01
4│ 0.0403│ 0.0383│ 0.8618│ 0.8809│ 3.00
5│ 0.0336│ 0.0388│ 0.8831│ 0.8766│ 3.01
6│ 0.0320│ 0.0446│ 0.8899│ 0.8725│ 3.00
7│ 0.0302│ 0.0448│ 0.8958│ 0.8498│ 3.02
8│ 0.0279│ 0.0311│ 0.9022│ 0.8840│ 3.01
9│ 0.0261│ 0.0292│ 0.9073│ 0.9017│ 3.00
10│ 0.0247│ 0.0271│ 0.9135│ 0.9030│ 3.02 : LB=0.865

比較データ：num_holes=10, max_h_size=int(0.1 * 512)：LB=0.880

4月18日（日）

HuBMAP：1,466 teams, 23 days to go

Loss functionは、BCE(BCEWithLogitsLoss)で固定したままだった。

今日は、Dice, Jaccard, Lovasz, Focalを試してみよう。

B2-Unet, 512 pixel, CyclicLR(SGD, nesterov, lr=0.1-2.0, up=3, down=7, 10 epochs)

BCE : LB=0.880

Dice :

1│ 0.3138│ 0.2053│ 0.6862│ 0.7947│ 3.05
2│ 0.1600│ 0.8500│ 0.8400│ 0.1500│ 3.02
3│ 0.1658│ 0.3692│ 0.8342│ 0.6308│ 3.05
4│ 0.1341│ 0.1762│ 0.8659│ 0.8238│ 3.02
5│ 0.1147│ 0.1261│ 0.8853│ 0.8739│ 3.01
6│ 0.1147│ 0.1424│ 0.8853│ 0.8576│ 3.04
7│ 0.1035│ 0.1919│ 0.8965│ 0.8081│ 3.01
8│ 0.1214│ 0.1200│ 0.8786│ 0.8800│ 3.02
9│ 0.0917│ 0.0930│ 0.9083│ 0.9070│ 3.01
10│ 0.0851│ 0.1060│ 0.9149│ 0.8940│ 3.01 : LB=0.812

LBスコアは非常に低い。それでも、次は、downを7から12に増やしてみよう。

Jaccard :

1│ 0.5231│ 0.9143│ 0.6076│ 0.1578│ 3.05
2│ 0.3025│ 0.7233│ 0.8190│ 0.4285│ 3.02
3│ 0.2833│ 0.8238│ 0.8330│ 0.2968│ 3.05
4│ 0.2603│ 0.3547│ 0.8485│ 0.7812│ 3.06
5│ 0.2442│ 0.3780│ 0.8583│ 0.7605│ 3.02
6│ 0.2029│ 0.3472│ 0.8861│ 0.7875│ 3.04
7│ 0.1871│ 0.2345│ 0.8960│ 0.8621│ 3.02
8│ 0.1695│ 0.1764│ 0.9068│ 0.9017│ 3.03
9│ 0.1618│ 0.1593│ 0.9111│ 0.9126│ 3.04
10│ 0.1510│ 0.1796│ 0.9179│ 0.8991│ 3.03 : LB=0.814

Diceと同様、9エポックで止めた方がスコアは良さそうだ。次は、downを7から12に増やしてみよう。

Lovasz :

機能せず。原因不明。

Focal :

収束はするが、非常に遅い。train_Dice_coeff, Val_Dice_coeffともに、0.72を超えず。

Lovasz, Focal, ともに、期待していたのだが、使ってみなければわからないものだ。

SGDとの相性だろうか。

Dice_LossとJaccard_Lossは、どちらもスコアアップの可能性がありそうなので、最適条件を探索してみたいが、時間も資源も限られている。そこで、BCE_lossで良い結果を得ている、up=3, down=12の条件にするとともに、Cutoutも強め（num_holes=1, max_h_size=int(0.447 * 512), 0.447^2≒0.2）にして計算してみる。

Dice_Loss :

1│ 0.3315│ 0.2272│ 0.6685│ 0.7728│ 3.04
2│ 0.1863│ 0.9131│ 0.8137│ 0.0869│ 3.03
3│ 0.1695│ 0.9015│ 0.8305│ 0.0985│ 3.00
4│ 0.1760│ 0.3990│ 0.8240│ 0.6010│ 3.01
5│ 0.1522│ 0.1426│ 0.8478│ 0.8574│ 3.03
6│ 0.1270│ 0.1614│ 0.8730│ 0.8386│ 2.99
7│ 0.1271│ 0.1413│ 0.8729│ 0.8587│ 3.01
8│ 0.1161│ 0.1543│ 0.8839│ 0.8457│ 3.00
9│ 0.1142│ 0.1338│ 0.8858│ 0.8662│ 2.99
10│ 0.1064│ 0.1156│ 0.8936│ 0.8844│ 3.02
11│ 0.1093│ 0.1155│ 0.8907│ 0.8845│ 3.00
12│ 0.1068│ 0.1251│ 0.8932│ 0.8749│ 3.03
13│ 0.0989│ 0.1062│ 0.9011│ 0.8938│ 3.03
14│ 0.0918│ 0.1090│ 0.9082│ 0.8910│ 3.00
15│ 0.0882│ 0.0841│ 0.9118│ 0.9159│ 3.05 : LB=0.842

downを5エポック増やして、Cutoutを加えてみたが、だめだった。

JaccardLoss :

1│ 0.4901│ 0.9936│ 0.6367│ 0.0126│ 3.20
2│ 0.3292│ 0.9128│ 0.8001│ 0.1601│ 3.17
3│ 0.2946│ 0.9769│ 0.8245│ 0.0446│ 3.17
4│ 0.2760│ 0.7785│ 0.8374│ 0.3552│ 3.17
5│ 0.2781│ 0.7158│ 0.8355│ 0.4395│ 3.17
6│ 0.2369│ 0.4669│ 0.8641│ 0.6888│ 3.16
7│ 0.2201│ 0.2245│ 0.8749│ 0.8727│ 3.16
8│ 0.2094│ 0.1894│ 0.8821│ 0.8944│ 3.15
9│ 0.2038│ 0.2969│ 0.8852│ 0.8208│ 3.17
10│ 0.2022│ 0.2547│ 0.8866│ 0.8520│ 3.17
11│ 0.1978│ 0.1894│ 0.8893│ 0.8944│ 3.17
12│ 0.1827│ 0.1987│ 0.8986│ 0.8851│ 3.16
13│ 0.1775│ 0.1939│ 0.9019│ 0.8923│ 3.17
14│ 0.1749│ 0.1821│ 0.9035│ 0.8990│ 3.17
15│ 0.1663│ 0.1646│ 0.9087│ 0.9096│ 3.18 : LB=0.812

train_loss > val_loss and train_Dice < val_Diceとなったので、スコアの改善を期待したのだが、まったくだめ。どうなっているのか。

先日、Dice coefficientを計算できるようにしたときに、この値とLBスコアの対応が良くなかったので、計算が間違っているかもしれないと思って、使わなかったのだが、train_lossやval_lossとの対応が比較的良いので、また、使うようにした。

この結果は、受け入れがたい。どうなっているのだろう。

＊JaccardLossよりは、DiceLossの方が、少し可能性を残しているように思う。

CyclicLRの1サイクル20エポック（up=3, down=17）で、512 pixel, BCE, 1 hole, area=0.25を試した。LB=0.875(v37)となった。up=3, down=12でLB=0.878だったので、これ以上は期待できないという結果となった。

4月19日（月）

HuBMAP：1,470 teams, 22 days to go

512 pixel, BCE, 1 hole, area=0.25を、これまでの何通りかのOneCycleLR, CyclicLRで試す。

v38 : OneCycleLR(AdamW), 20エポック : LB=0.870

v39 : OneCycleLR(SGD), 20エポック : LB=0.862

v40 : CyclicLR(SGD, up=2, down=5, 0.1-2.0) : 3サイクル：LB=0.858

v41 : Cyclic_LR(SGD, up=3, down=7, 0.1-2.0) : 3サイクル：収束が悪いのでcommitせず。

ここまでの検討結果：CyclicLRのサイクル数を増やすことによりスコアアップを図ってきたが、CyclicLRの1サイクルによって同等のスコアを得ることが可能であることがわかり、サイクル数を増やしても、1サイクルを超えるスコアは得られなかった。

気になることは、CyclicLRの1サイクルのスコアを、1サイクル専用の、OneCycleLRでは、まだ、同レベルに達していないことである。

ということで、明日は、OneCycleLRでLB＝0.880に到達し、超えることを試みる。

4月20日（火）

HuBMAP： 1,469 teams, 21 days to go

512 pixel, BCE, 1 hole, area=0.25 ---> 10 hole, area=0.1

OneCycleLR(AdamW, maxlr=1e-3, epochs=10) : LB=0.872

OneCycleLR(SGD, maxlr=2.0, epochs=10) : LB=0.849

Blur, GaussNoiseの割合を増やして、同じ計算を行った。

OneCycleLR(AdamW), Blur 0.25 -> 0.5, GaussNoise 0.25 -> 0.5 : LB=0.881

OneCycleLR(SGD), Blur 0.25 -> 0.5, GaussNoise 0.25 -> 0.5 : LB=0.862

2週間ぶりに、ようやく自己新記録だ。わずか、0.001の改善であったが、augmentationの条件：BlurやGaussianNoiseなどを少し多くすることによってスコアアップしたことは、良かったと思っている。これを、CyclicLRで0.880が得られている条件に適用することで、もう少しスコアアップできる可能性があると思う。

これで、GPU割当時間はほぼ使い切った。

水、木、金は、これまでのデータを整理して、土曜日からのサイクルで行う実験の計画を立てようと思う。

4月21日（水）

HuBMAP：1,474 teams, 20 days to go

このコンペで使えるKaggleのGPUはあと90時間+α。1つの計算に90分かかるとすると、commitがあるので、submitまでに3時間かかる。ということは、あと30回チャレンジできる。

300回のトライで、0.83から0.88まで上がったので、このペースだと、0.885までということになる。

1月7日にRiiid!コンペが終了し、その後（その前にも少し）、HuBMAPにとりかかったので、もう3か月以上取り組んでいる。これも、あと20日で終了する。

qubvel/segmentation_models.pytorch

このsegmentation_modelsパッケージを使うことによって、様々なEncoder/Decoderの組み合わせ、いくつかのloss_functionの組み合わせ、を試すことができた。

https://www.kaggle.com/leighplt/pytorch-fcn-resnet50

このコードとその高速版を使わせていただき、そこに、segmentation_modelsを追加するとともに、いくつかのaugmentationのセット、optimizer, learning rate schedulerなどを追加した。TTAも組み込んだが毎回使うと時間のロスになるので、テスト的に使っただけである。KFoldとアンサンブルは、まだ、使っていない。それらは最後まで使わないかもしれない。

これまで、augmentationについて、画像をチェックしながら、その効果を検討したことが無かった。パラメータを変更してみて、スコアがどう動くかを見ていただけである。

あと20日間の間に、augmentationの効果について、具体的に、画像をチェックしながら、スコアとの関係を把握したいと思っている。

昨日確率を0.25から0.5に変更したBlurとGaussNoise：

GaussNoiseは、パラメータの設定値が非常に小さく、あまり効いていなかったと推測される。少なくとも、自分の目では変化を確認できなかった。パラメータの設定値を10倍にしても目視で変化を確認できなかった。

4月24日追記：GaussNoiseのパラメータは分散と平均で、デフォルトは、var_limit=(10.0, 50.0), mean=0となっている。

今日まで、画像をチェックしないで、得られた結果だけをみてパラメータを機械的に変更してきた。これでは、何も学んだことにならないと思う。

画像は、モデルに入力する段階で規格化されたり、他の演算が行われて、目視した画像とは全く違ったものになって演算が進むので、元画像に対するaugmentationの効果を目視で確認することが、augmentationの効果を理解するうえでどういう意味を持つかはわからないが、とにかくやってみることにする。

Blurは、blur_limit=3に設定している。この条件では、画像の解像度が1/2～1/3程度になっているようにみえる。これで効果があったのだから、(3, 5)とか(5, 5)くらいまでは、GPUが使えるようになれば計算してみるつもりだ。これ以上数字を大きくすると128とか64ピクセルレベルの画像になる。それでも意味が無いとは言い切れない。一度は試してみてもよいかもしれない。

明日は、Brightness, Contrast, Gamma等がtrain画像に対してどの程度の変化を与えているのかを自分の目で見て確認する予定。

4月22日（木）

HuBMAP：1,478 teams, 19 days to go

RandomGamma：gamma_limit=(80, 120)：これがデフォルトだが、80と120の定義がわからない。

ウィキペディアには次のような記述がある。

入力値と出力値が直線の関係を示す場合、ガンマ値は1 (γ=1) となるが、γ<1の場合は階調が明るい出力に、γ>1の場合は階調が暗い出力になる。例えば、ガンマ値2.2のディスプレイで適正に表示される画像をガンマ値1.8のディスプレイに表示した場合、実際のガンマ値はγ=1.8/2.2≒0.82となり、意図したものよりも明るい画像となる。画像の入力から最終出力までの全体のガンマが1になるよう、適当なガンマ値のカーブに従って画像の階調を補正することをガンマ補正という。

Qiitaには画像を使ってパラメータと見え方の例が示されているが、その記事には、数値の意味は書かれていなかった。gamma_limit=(50, 50)は元画像より明るく、gamma_limit=(150, 150)は元画像より暗い。

デフォルト値、画像例から推測すると、100がノーマルで、数値を下げれば明るく、数値を上げれば暗くなり、2つの数値によって、最大値と最小値を指定する、ということになるのかな。ウィキペディアを英語に切り替えるとluminanceという単語が出てきて、1または100でnormalizeと書かれているので、これでよさそうだ。

RandomBrightnessとRandomContrast：いずれもデフォルトはlimit=(-0.2, 0.2)であり、マイナス側は、暗い、コントラストが低い（結果として暗く感じる）、プラス側はその逆となる。

これをどう使うか。

Cutoutもいくつか試したが、まだまだ、探索の途中である。

gamma, brightness, contrastなどがどう効くのか、１つ１つ試してみるしかない。

なんとなく、ものすごく、原始的な作業をしているような気がしてきた。どこが人工知能か。１つ１つが手作業というのは、何かおかしい。

F. Chollet氏のテキストの犬と猫の分類で、500枚づつのデータを使用するとoverfittingによってaccuracyが0.7くらいで頭打ちになり、augmentationでデータ数を増やすことによって、0.85くらいまで上がるということから、augmentationの効果を体感するというのがあった。

今使っているaugmentationは、画像の枚数は変わらずに、部分的に画像を加工しているだけのような気がする。このことがずっと気になっている。

KaggleのコンペのDiasussionで、誰かが、PyTorchよりもKerasの方がスコアが高いように思う。なぜだろう、と問いかけていたことがあるのを思い出した。もしかすると、このことが原因となっていたということではないだろうか。

F. Chollet氏のテキストでKerasを使っていたので、augmenrationは、画像枚数を増やすものだと思っていた。

しかし、今使っているPyTorchのコードでは、train_dataの数は、augmentationを使っても増えない。指定した確率で、加工されているだけである。たとえば、加工がAだけで、その確率を0.25と指定すると、元の画像に、Aという加工がされた画像の25％が追加されるのではなく、元画像が75％で、Aという加工をした画像が25％で、合わせて100％となり、合計枚数は増えないのである。

これまでは、PyTorchでaugmentationを使っていても、augmentationを含む前処理をしながらモデルにデータを投入していたので、augmentationによって画像が増えていたかどうかはわからないが、今回は、augmentationを含む前処理を済ませてからモデルにデータを投入しており、加工した画像の枚数が表示されるので、augmentationによって画像枚数が増えていないことは、確かだと思う。

KerasかPyTorchかということではなく、プログラム作成者の意図によるのだろうと思う。元データ＋transformed_dataがよいのか、元データを所定の確率でtransformしたデータがよいのかを選んだ結果なのであろう。プログラミング技術と計算資源の制限によって、どちらかを選ばないといけない状況になることもある。現状をわかったうえで次の手段を考え、準備しているつもりだが、今は、プログラミング技術がボトルネックになっている。

明日は、有効なaugmentation方法の探索をやってみよう。使っていない手法が山ほどある。

AutoML: A Survey of the State-of-the-Art
Xin He, Kaiyong Zhao, Xiaowen Chu, arXiv:1908.00709v6 [cs.LG] 16 Apr 2021

Abstract
Deep learning (DL) techniques have obtained remarkable achievements on various tasks, such as image recognition, object detection, and language modeling. However, building a high-quality DL system for a specific task highly relies on human expertise, hindering its wide application. Meanwhile, automated machine learning (AutoML) is a promising solution for building a DL system without human assistance and is being extensively studied. This paper presents a comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. According to the DL pipeline, we introduce AutoML methods –– covering data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS) –– with a particular focus on NAS, as it is currently a hot sub-topic of AutoML. We summarize the representative NAS algorithms’ performance on the CIFAR-10 and ImageNet datasets and further discuss the following subjects of NAS methods: one/two-stage NAS, one-shot NAS, joint hyperparameter and architecture optimization, and resource-aware NAS. Finally, we discuss some open problems related to the existing AutoML methods for future research.

Keywords: deep learning, automated machine learning (AutoML), neural architecture search (NAS), hyperparameter optimization (HPO)

概要
ディープラーニング（DL）技術は、画像認識、オブジェクト検出、言語モデリングなどのさまざまなタスクで目覚ましい成果を上げています。ただし、特定のタスクのために高品質のDLシステムを構築することは、人間の専門知識に大きく依存しており、その幅広いアプリケーションを妨げています。一方、自動機械学習（AutoML）は、人間の支援なしでDLシステムを構築するための有望なソリューションであり、広く研究されています。このホワイトペーパーでは、AutoMLの最先端（SOTA）の包括的で最新のレビューを紹介します。 DLパイプラインによると、現在AutoMLのホットなサブトピックであるため、NASに特に焦点を当てて、データ準備、機能エンジニアリング、ハイパーパラメータ最適化、ニューラルアーキテクチャ検索（NAS）をカバーするAutoMLメソッドを紹介します。 CIFAR-10およびImageNetデータセットでの代表的なNAS アルゴリズムのパフォーマンスを要約し、NASメソッドの次の主題についてさらに説明します：1/2ステージNAS、ワンショットNAS、共同ハイパーパラメータとアーキテクチャの最適化、およびリソース認識NAS。最後に、将来の研究のために、既存のAutoMLメソッドに関連するいくつかの未解決の問題について説明します。　by Google翻訳

f:id:AI_ML_DL:20210423004754p:plain

f:id:AI_ML_DL:20210423004905p:plain

f:id:AI_ML_DL:20210423005008p:plain

2.3. Data Augmentation
To some degree, data augmentation (DA) can also be regarded as a tool for data collection, as it can generate new data based on the existing data. However, DA also serves as a regularizer to avoid over-fitting of model training and has received more and more attention. Therefore, we introduce DA as a separate part of data preparation in detail. Figure 3 classifies DA techniques from the perspective of data type (image, audio, and text), and incorporates automatic DA techniques that have recently received much attention.
For image data, the affine transformations include rotation, scaling, random cropping, and reflection; the elastic transformations contain the operations like contrast shift,
brightness shift, blurring, and channel shuffle; the advanced transformations involve random erasing, image blending, cutout [89], and mixup [90], etc. These three types of
common transformations are available in some open source libraries, like torchvision 5, ImageAug [91], and Albumentations [92]. In terms of neural-based transformations, it can be divided into three categories: adversarial noise [93], neural style transfer [94], and GAN technique [95].

4月23日（金）

HuBMAP：1,492 teams, 18 days to go

少し前に見ていた文献を、昨夜なんとなくながめていて、少し気になる箇所があった。それは、trainingの初期状態が、trainingの結果に大きな影響を及ぼすということである。ImageNetなどによる学習済みモデルを用いる場合には、小さな学習率から始めることによって、モデルの事前学習が生かされて、良いtraining結果が得られると学んだ。しかし、この論文の趣旨は、SGDでtrainingする場合に、trainingの初期に、小さくない学習率でtrainingすると、良いtraining結果が得られるということのようである。

THE BREAK-EVEN POINT ON OPTIMIZATION TRAJECTORIES OF DEEP NEURAL NETWORKS
Stanisław Jastrz˛ebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor
Kyunghyun Cho, Krzysztof Geras, Published as a conference paper at ICLR 2020

ABSTRACT
The early phase of training of deep neural networks is critical for their final performance.
In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the optimization trajectory. We argue for the existence of the “break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD. In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the covariance of gradients. These effects are beneficial from the optimization perspective and become visible after the break-even point. Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers. In short, our work shows that key properties of the loss surface are strongly influenced by SGD in the early
phase of training. We argue that studying the impact of the identified effects on
generalization is a promising future direction.

概要
ディープニューラルネットワークのトレーニングの初期段階は、最終的なパフォーマンスにとって重要です。この作業では、トレーニングの初期段階で使用される確率的勾配降下法（SGD）のハイパーパラメーターが残りの最適化軌道にどのように影響するかを調べます。この軌道上に「損益分岐点」が存在することを主張します。それを超えると、損失面の曲率と勾配のノイズがSGDによって暗黙的に正則化されます。特に、大規模な学習を使用することを複数の分類タスクで示します。トレーニングの初期段階でのレートは、勾配の分散を減らし、勾配の共分散の調整を改善します。これらの効果は、最適化の観点から有益であり、損益分岐点の後に見えるようになります。以前の作業を補完して、次のことも示します。低い学習率を使用すると、バッチ正則化層を備えたニューラルネットワークでも損失面のコンディショニングが悪くなります。要するに、私たちの研究は、損失面の主要な特性がトレーニングの初期段階でSGDの影響を強く受けることを示しています。正則化に対する特定された影響の影響を研究することは、有望な将来の方向性です。by Google翻訳

この論文の記述と関係がありそうな実験結果を眺めてみよう。

CyClicLRのパラメータ、base_lrを、0.1, 0.001, 0.00001として計算してみた。

v29 : 2 cycles, 20 epochs：0.1-2.0 : LB=0.839 : (train_loss=0.0258, val_loss=0.0301)

v31 : 2 cycles, 20 epochs：0.001-2.0 : LB=0.853 : (train_loss=0.0256, val_loss=0.0311)

v32 : 2 cycles, 20 epochs：0.00001-2.0 : LB=0.847 : (train_loss=0.0255, val_loss=0.0339)

この結果（LBスコア）からは、base_lrとして、0.1よりも0.001や0.00001の方が良いと判断するところだが、別の観点から0.1が良いと判断した。それは、最終エポックでのtrain_lossとval_lossである。base_lrを小さくすると、train_lossはあまり変わらないのに、val_lossが明らかに大きくなっていることから良くないと判断した。今見ると、これだけの結果で、0.1の一択にしてしまったのは早計だったようであり、上記論文の内容を見ていると、lrの最小値における一桁の違いは大きく、base_lr=0.01も調べておく必要がありそうだと思う。ということで、base_lr=0.1, 0.01, 0.001の比較を、512 pixelの場合について、実施してみよう。

4月24日（土）～ 4月30日（金）の予定

・base_lr=0.1, 0.01, 0.001の比較

・GaussNoise, Blur, (gamma, contrast, brightness)の効果確認

・Cutoutは、オリジナルの論文に示されている、ホール数1、面積率0.25にこだわってその効果を調べていたが、ホール数10で、面積率0.1, 0.15, 0.2と変えて、その効果を調べてみる。

4月24日（土）

HuBMAP：1,503 teams, 17 days to go

GPU：38 h

今日は、「base_lr=0.1, 0.01, 0.001の比較」をやってみよう。

Baseは、LB=0.880（v2/v25）で、512 pixel, CyclicLR(SGD, 0.1-2.0, up=3 epochs, down=7 epochs), BCELoss, augmentationは、flip, rotateの他に、cutout(hole_num=1, total_area=0.1), Blur(limit=(3,3), p=0.5), GaussNoiseはデフォルトでp=0.5, brightness, contrast, gamma等はデフォルトでp=0.5など。

trfm_4 = A.Compose([
A.Resize(512, 512),
A.OneOf([
A.RandomBrightness(limit=(-0.2,0.2), p=1),
A.RandomContrast(limit=(-0.2,0.2), p=1),
A.RandomGamma(gamma_limit=(80,120),p=1)
], p=0.5),
A.OneOf([
A.Blur(blur_limit=(3,3), p=1),
A.MedianBlur(blur_limit=(3,3), p=1)
], p=0.5),
A.GaussNoise(var_limit=(10.0,50.0), mean=0, p=0.5),
A.RandomRotate90(p=.5),
A.HorizontalFlip(p=.5),
A.VerticalFlip(p=.5),
A.Cutout(num_holes=10, max_h_size=int(.1 * 512), max_w_size=int(.1 * 512), p=.25),])

結果：

1│ 0.1406│ 0.1000│ 0.5418│ 0.7483│ 3.98
2│ 0.0462│ 0.0695│ 0.8463│ 0.7611│ 3.96
3│ 0.0389│ 0.0504│ 0.8689│ 0.8558│ 3.97
4│ 0.0357│ 0.0364│ 0.8777│ 0.8743│ 3.93
5│ 0.0336│ 0.0363│ 0.8822│ 0.8813│ 3.96
6│ 0.0318│ 0.0323│ 0.8903│ 0.8976│ 3.94
7│ 0.0280│ 0.0321│ 0.9020│ 0.8903│ 3.96
8│ 0.0267│ 0.0353│ 0.9055│ 0.8749│ 3.94
9│ 0.0256│ 0.0271│ 0.9086│ 0.9073│ 3.98
10│ 0.0241│ 0.0322│ 0.9152│ 0.8960│ 3.96 : LB=0.883

これは、自己新記録だ。0.002だけ上がった。

overfittingになっていると思うので、もう少しaugmentationの強度を上げても良さそうに思うのだが、・・・。

予定していなかったが、損失関数の組み合わせをやってみよう。時間がかかって、効果が出ないことを覚悟で実験する。

最初に、BCEとDiceとJaccardを等分で組み合わせてみる。

上記のLB=0.883の条件を踏襲するが、画素数512 pixelではGPUメモリーが15.8 GBとギリギリなので、448 pixelとする。結果は次の通り、LBスコアは、非常に低い。

1│ 0.3207│ 0.2839│ 0.6660│ 0.7603│ 3.17
2│ 0.1688│ 0.4044│ 0.8523│ 0.6247│ 3.15
3│ 0.1431│ 0.1717│ 0.8758│ 0.8555│ 3.14
4│ 0.2095│ 0.2449│ 0.8339│ 0.7640│ 3.11
5│ 0.1360│ 0.3452│ 0.8833│ 0.7496│ 3.12
6│ 0.1273│ 0.1305│ 0.8908│ 0.8872│ 3.14
7│ 0.1062│ 0.1581│ 0.9089│ 0.8672│ 3.13
8│ 0.1032│ 0.1118│ 0.9114│ 0.9015│ 3.13
9│ 0.0983│ 0.0969│ 0.9158│ 0.9168│ 3.14
10│ 0.0901│ 0.0890│ 0.9225│ 0.9232│ 3.12 : LB=0.840

train_dice_coeff=0.9225とval_dice_coeff=0.9232から、新記録を期待したのだが、非常に残念な結果となった。何がおきているのだろうか。同じことは、DiceLossとJaccardLoss単独のときにも生じていた。いずれも、val_dice_coeffは0.90を超えていたが、LB＝0.842とかLB=0.812という値になっていた。

あきらめきれないので、downのエポック数を7から11に増やしてみた。

1│ 0.3207│ 0.2839│ 0.6660│ 0.7603│ 3.19
2│ 0.1688│ 0.4044│ 0.8523│ 0.6247│ 3.16
3│ 0.1431│ 0.1717│ 0.8758│ 0.8555│ 3.15
4│ 0.1337│ 0.1880│ 0.8842│ 0.8265│ 3.14
5│ 0.1258│ 0.1937│ 0.8908│ 0.8218│ 3.13
6│ 0.1209│ 0.1141│ 0.8952│ 0.9038│ 3.16
7│ 0.1083│ 0.1476│ 0.9061│ 0.8691│ 3.14
8│ 0.0986│ 0.1044│ 0.9151│ 0.9085│ 3.14
9│ 0.0992│ 0.1300│ 0.9141│ 0.8923│ 3.16
10│ 0.0963│ 0.0974│ 0.9170│ 0.9158│ 3.15
11│ 0.0925│ 0.0941│ 0.9201│ 0.9201│ 3.14
12│ 0.0881│ 0.1002│ 0.9237│ 0.9146│ 3.16
13│ 0.0825│ 0.0915│ 0.9289│ 0.9214│ 3.15
14│ 0.0799│ 0.0912│ 0.9311│ 0.9222│ 3.14 : LB=0.855

明日、BCE：Dice：Jaccard＝８：１：１を試してみよう。

4月25日（日）

HuBMAP：1,517 teams, 16 days to go

train_dataは15件ある。今使っているのは、当初の8件＋2件である。未使用のTrain_dataとpublic_LBのpublic_test_dataに大きな違いがあれば、val_dice_coeffとLBスコアの間に大きな乖離があっても不思議ではない。

このことを確認するために、train_dataの追加2件を変えてみる。

以下に示すのが、2月初旬にデータが更新されたときに追加されたtrain_dataである。pub_testと書いてある5件が、データ更新前の、public_LBスコアの計算に用いられていたpublic_test_dataである。これまで使ってきた追加2件のデータは紫色で示した。この後、茶色で示した2件のデータに変えてみる。

'26dc41664': 1.6e9 245 pub_test
'4ef6695ce': 2.0e9 439
'8242609fa': 1.4e9 586
'afa5e8098': 1.6e9 235 pub_test
'b2dc8411c': 4.6e8 138 pub_test
'b9a3865fc': 1.3e9 469 pub test
'c68fe75ea': 1.3e9 118 pub test
8件のtrain_dataはそのままで、追加の2件をどう選ぶかによって、収束の仕方もLBスコアも全く違ったものになった。

紫色の2件追加：LB=0.883

黒色の3件追加：LB=0.850

カーキ色の2件追加：LB=0.846

驚きだ、8件まで同じで、残り7件のうちのどれを追加するかによってLBスコアがここまで違うとは、どういうことなのか。

private_dataに対するLBスコアも、また、違ってくるのだろう。こういうものなのか、それとも、自分のやり方が間違っているのか。こういうものだ、だから、KFoldもモデルのアンサンブルも、TTAも広範囲にやって平均をとらないとダメなんだよ、ということか。

3件のデータを追加した場合は、次に示すように、完全にoverfittingになっている。LBスコアが低いのはoverfittingだから、augmentationを強くしてみるということがLBスコアの向上につながるだろうと思う。

1│ 0.1138│ 0.0595│ 0.5990│ 0.7940│ 4.24
2│ 0.0379│ 0.0420│ 0.8618│ 0.8733│ 4.20
3│ 0.0327│ 0.0420│ 0.8800│ 0.8787│ 4.20
4│ 0.0282│ 0.0350│ 0.8954│ 0.8733│ 4.16
5│ 0.0258│ 0.0381│ 0.9029│ 0.8676│ 4.14
6│ 0.0242│ 0.0329│ 0.9089│ 0.8896│ 4.15
7│ 0.0221│ 0.0338│ 0.9152│ 0.8974│ 4.14
8│ 0.0209│ 0.0368│ 0.9199│ 0.8764│ 4.19
9│ 0.0200│ 0.0304│ 0.9229│ 0.8987│ 4.16
10│ 0.0192│ 0.0305│ 0.9254│ 0.9039│ 4.19 : LB=0.850

次に示す、カーキ色の2件のデータを追加した場合のtrainingの結果は、良さそうに思うのだが、LBスコアは低い。LBスコアが低い原因の１つには、trainingに使ったデータと、public_dataに違いがある、ということが考えられる。

1│ 0.1592│ 0.0940│ 0.4803│ 0.6938│ 3.42
2│ 0.0533│ 0.0547│ 0.8175│ 0.8650│ 3.39
3│ 0.0442│ 0.0510│ 0.8506│ 0.8180│ 3.41
4│ 0.0441│ 0.0430│ 0.8490│ 0.8539│ 3.38
5│ 0.0359│ 0.0555│ 0.8762│ 0.7945│ 3.39
6│ 0.0320│ 0.0404│ 0.8885│ 0.8592│ 3.37
7│ 0.0299│ 0.0361│ 0.8945│ 0.8478│ 3.34
8│ 0.0282│ 0.0347│ 0.9003│ 0.8770│ 3.38
9│ 0.0262│ 0.0311│ 0.9068│ 0.8877│ 3.37
10│ 0.0258│ 0.0273│ 0.9058│ 0.9042│ 3.35 : LB=0.846

追加データを3つに分けた時に、LBスコアに違いが生じたということは、現状の方法では、予測モデルの一般性：generalityが低いということを示唆しているように思う。

最初の8件のデータに、8件のデータと類似したデータを追加すると、overfittingが生じやすく、8件のデータとは違うデータを追加すると、overfittingは生じにくい、ということになっているようである。

今、3種類の予測モデルができた。１つは、overfittingしてLBスコアが悪いもの。1つは、overfittingせず、LBスコアがよいもの、１つは、overfittingせず、LBスコアが悪いもの。train_dataの選び方によって、異なる予測モデルが生じる。

どうすれば、糸球体を正確に補足することができる予測モデルをつくることができるのか、train_dataを眺めながら、考えてみよう。

メモ：今日、AtCoderに登録した。プログラミングをゼロから学びなおすために。C++の入門編（APG4b : AtCoder Programming Guide for beginners）から始めよう。

4月26日（月）

HuBMAP：1,533 teams, 15 days to go

CyclicLRのbase_lr=0.1を0.01, 0.001とする。使うのは、LB=0.883となったコード。

lbase_lr=0.1 : LB=0.883

base_lr=0.01 : LB=0.875

base_lr=0.001 : LB=0.870

こうなると、base_lrを0.1より大きくしてみたくなる。0.2を試してみよう。

base_lr=0.2 : LB=0.865

面白いが、役に立たない！

今後のことを考えて、モデルのアンサンブルをやってみよう。

明日は、B0-UnetとB2-Unetのアンサンブルに挑戦してみようと思う。

AtCoder：初心者用のC++コースを学習中。EX1からEX26まである練習問題のうち、EX10まで進んだ。

4月27日（火）

HuBMAP：1,540 teams,14 days to go

Effb0-UnetとEffb2-Unetの2つの予測モデルを作ってアンサンブルをするには、どうすればよいのか。

公開コードに学ぼう。

4月28日（水）

HuBMAP：1,553 teams, 13 days to go

公開コードのチューニング

4月29日（木）

HuBMAP： 1,566 teams, 12 days to go

今日も公開コードのチューニング。

trainコードとinferenceコードの両方が公開されていて、使いやすく、よくできていて、LBスコアも高く、そのチューニングに、はまっている。

4月30日（金）

HuBMAP：1,571 teams, 11 days to go

今日も、チューニング。

4月27日に得たLBスコアのまま、進展なし。

Lookaheadが使われているのだが、パラメータの最適化に時間がかかりそうだ。

5月1日（土）

HuBMAP：1,570 teams, 10 days to go

Lookaheadの論文を見よう。

f:id:AI_ML_DL:20210501101146p:plain

arXiv:1907.08610v2 [cs.LG] 3 Dec 2019

Abstract
The vast majority of successful deep neural networks are trained using variants of
stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of “fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR- 10/100, neural machine translation, and Penn Treebank.

成功したディープニューラルネットワークの大部分は、確率的勾配降下法（SGD）アルゴリズムの変形を使用してトレーニングされています。 SGDを改善する最近の試みは、大きく2つのアプローチに分類できます。（1）AdaGradやAdamなどの適応学習率スキーム、および（2）ヘビーボールやネステロフ運動量などの加速スキームです。この論文では、これらの以前のアプローチに直交し、2セットの重みを繰り返し更新する新しい最適化アルゴリズムであるLookaheadを提案します。直感的に、アルゴリズムは別のオプティマイザーによって生成された「高速重み」のシーケンスを先読みして検索方向を選択します。ルックアヘッドが学習の安定性を向上させ、計算とメモリコストを無視して内部オプティマイザーの分散を低減することを示します。先読みは、ImageNet、CIFAR- 10/100、ニューラル機械翻訳、およびPenn Treebankのデフォルトのハイパーパラメーター設定を使用しても、SGDとAdamのパフォーマンスを大幅に向上させることができます。＜Google翻訳＞

f:id:AI_ML_DL:20210501111813p:plain

こういう図表を見ると、Lookaheadよりも、SGDとADAMの性能の違いの方が気になる。さらに、SGDに対しては、Lookaheadの効果はわずかである。

5月2日（日）

HuBMAP：1,570 teams, 9 days to go

Lookaheadのsync_periodを、文献に示されている範囲（5, 10, 20）で変えてみた（6を12にしてみた）が、目に見える違いは、殆ど認められなかった。初期の20～30エポック以内で振動構造が認められる場合に、その周期が変化したように見えなくもないが、そう見えるのは先入観によるものかもしれない。LBスコアはわずかに下がった。

5月3日（月）

HuBMAP：

learning_schedulerのパラメータを変更しようとしたが、PyTorchと比べると、設定（変更）できるパラメータが少ないと感じた。最近のKerasについての学習不足によるものかもしれない。

5月4日（火）

HuBMAP：

TPUの30時間を使い果たしたので、GPUに切り替えてtraining条件を検討しようとしているが、計算速度が圧倒的に遅い（1/20-1/30）ことと、Batch_sizeの極端な違いなどのために、TPUでの計算結果をフォローするのが難しい。

Lookaheadのsync_periodの効果を調べたいのだが、GPUでは無理かな。

GPUで試していて思ったのだが、TPUでは、計算速度が速いので100エポックでも200エポックでも試すことができるので、エポック数を増やしすぎていたかもしれない。

GPUだと、12エポックでも５Foldだと6時間くらいかかる。それが、TPUだと、100エポックの5 Foldが2時間半くらいで計算できる。

学習率の初期値、バッチサイズ、などを変えてみたが、val_dice_coefficientが0.80を超えないので、GPUでのトライは断念する。

あとは、TPUの最後の30時間をどう使うか、よりよいトレーニング条件を探すために、これまでに試みた10件弱のトレーニング結果を振り返って、検討してみよう。

5月5日（水）

HuBMAP：1,554 teams, 6 days to go

進展なし。

5月6日（木）

HuBMAP：1,564 teams, 5 days to go

進展なし。

土曜日の朝にTPUが使えるようになるのを待つのみ。

5月7日（金）

HuBMAP：1,570 teams, 4 days to go

5件のpublic_test_dataのうちの1件は、糸球体の内部構造がつぶれているように見えるものが多いようで、そのために、正しく囲うのが難しく、結果としてLBスコアが上がらないということになる。このtest_dataに対して、ハンドラベリングして、train_dataに追加すれば、LBスコアは上がる。ハンドラベルのデータは公開されていて、今回のコンペではこのような手段は規則違反でないことも公式に確認されている。

train_dataとの類似性の低いデータに対して、どれだけ予測性能を上げることができるかということで争っているチームのスコアが、0.93+のレベルで、5件のうちの1件に対してハンドラベルを用いると0.934くらいになることは、ハンドラベルデータを公開した方が報告していたように思う。これ以外にも、予測困難な例（試料が薄いために、明るくてコントラストが小さかったように思う）がDiscussionで紹介されていたように思う。

5件のpublic_test_dataに対するハンドラベルや擬ラベルを用いることによって、train_dataが増えるので、public_test_dataに対する予測精度が向上するだけでなく、private_test_dataに対する予測精度も良好であることが期待される。おそらく、0.935+のチームの方々は、そうしているのであろうと思う。

自分（公開コードを借用してチューニング中）もそうだが、5件のpublic_test_dataをtrain_dataに加えることができないチームは、高いpublic_LBスコアも、高いprivate_LBスコアも、難しいのではないかと思う。

とはいえ、せっかく公開していただいた優れたコードなので、なんとか、top 10％以内に留まれるようにしたいのだが、どうしたものか。

＜雑談＞今使っているコードはKerasで書かれていて、callbacksが使えるレベルまでには理解できていないことに気付き、F. Chollet氏のDeep Learningのテキストを読み返すことにした。画像処理分野は、Chapter 5のDeep Learning for computer visionまで勉強していたが、Chapter 7のAdvanced deep-learning best practicesは、殆ど読んでいなかった。7.2に面白い記述がある。大きなデータセットに対してmodel.fit( )またはmodel.fit_generator( )を用いて何十エポックものトレーニングの計算をすることは、紙飛行機を飛ばすようなもので、一旦手を離れると飛行経路も着陸位置も制御できない。これに対して、紙飛行機ではなくドローンを飛ばせば、環境情報を把握することができ、その情報をオペレータに伝えることによって状況に即した飛行をさせることができる。それがcallbacksの機能ということである。keras.callbacksとして、ModelCheckpoint, EarlyStopping, LearningRateScheduler, ReduceLROnPlateau, CSVLoggerなどの説明が続く。

TPU：バッチ数を多くすることによる高速化のイメージがある。Flower Classification with TPUsというコンペで、チュートリアルモデルでは、128バッチで計算していて、Adamを用い、学習率は0.00001からスタートし、5エポックくらいで0.0004まで上げ、12エポックで0.00011くらいまで下げるというような感じであった。なんだかCyclicLRに似ている。

これがそのまま適用できるとは思わないが、次の一手が見つからなかったので、明日は、このバッチ数と学習率でどうなるか、試してみようと思う。

5月8日（土）

HuBMAP：1,162 teams, 3 days to go：昨日より、400チームくらい少なくなっている。何がおきたのだろうか。データセットを更新する前の状態のまま放置しているチームがいなくなったようだ。自分のsubmit回数も、そのぶん、減っている。ということで、母数が減ったのでメダル圏内に該当するチーム数が40ほど減った。それゆえ、一気にメダル圏外に押し出され、やる気が失せた。残念だな。

とはいうものの、最後までチャレンジしないと、もったいない。

バッチ数を小さくして、少し、検討してみた。ここまでは1024だったのを、512、256、128、および64について検討した。バッチ数が小さくなるほど、収束は早くなるが、train_lossとval_lossの反転が速くなり、この中では1024がベストであった。

それならばと、1024以上（2048、1536、1280など）に設定しようとしたが、エラーとなって停止した。1024が設定可能な最大値のようである。

5月9日（日）

HuBMAP：1,172 teams, 2 days to go

明日で終わり（明後日の午前9時）だが、もうスコアアップの手立てはなくなった。

ダメもとで計算（主に、TPUを使ったtraining）しているが、テキトーにパラメータを変えても何も（val_diceやLBスコア）改善されなかった。

今から条件を検討するだけの計算資源も時間もないので、今回のコンペは、これでおしまい。

＊前処理とsubmitのコードを借用し、trainingu部分をSemantic Segmentationのコードを使って種々検討したが、単独モデルでは、LB=0.883（val_dice_coefficientは0.91程度）までであった。

＊最後の2週間は、LB=0.927の公開コードを借用し、最初はLB=0.920だったが、training条件を検討して、0.932になった。といっても、特段の工夫をしたわけではなく、最終的には、公開コードのオリジナルの条件を追い越すことはできなかった。

＊今回のコンペは、最初の論文調査の段階で、複雑なモデルを使う必要はなさそうだという感触を得たが、途中段階では、様々なモデルの組み合わせを試し、比較的新しい論文に出てくるモデルをGitHubから借用してみたりもした。スコアへの影響が大きかったのは、augmentation(albumentation)と、KFoldであった。TTAの効果はよくわからなかった。モデルのアンサンブルは、何度かやりかけたが、結局、借用も含め、試すことができなかった。擬ラベルも試すことができなかった。

＊実質、3か月以上をこのコンペに費やした。

5月10日（月）

HuBMAP：1,172 teams, a day to go

あと2回、seed_numberを変えて丁寧にrainingした結果を用いてinferenceした結果を提出して終わり。

2回trainingし、それを用いてinferenceした結果、いずれも、LB=0.930となった。seedを変えたので、val_dicecoefficientは様々な値となり、OOF lossやOOF dice_coefficientも２つの場合で大きく異なっていたにもかかわらず、LBスコアが同一だったことに驚いた。重要なメッセージが隠れているように思うのだが、わからない。

これまでの結果では、0.932と0.931もあるので、最終的に提出するコードをどれにすれば良いのか、さっぱりわからない。

以上で、今回のコンペは終了である。

明日からは、Bristol-Myers Squibb – Molecular Translationに参加するつもりである。

5月11日（火）

HuBMAP：

コンペの暫定結果が出た。public_LBの順位からは、130番下がった。

今回は、大波乱がおきた。

このコンペの目標がメダル狙いだけだったら、ほんとうに、がっかりするところだが、いろいろチャレンジできておもしろかった、ということにしておこう。

最終的に選択した2件のコードは、結果としては、適切ではなかったが、最後に選ぼうかなと迷ったコードでも、200～211位相当であった。

メダル圏に最も近かったのは、126～149位相当のコードであるが、自分の基準では、選ぶ理由はなかった（最終的に選んだコードと比べても、Public_LBスコアが0.015くらい小さかったから）。

最後に、今回のHuBMAPコンペは、最後に大波乱がおきて、とてもひどいことになったが、いろいろ学べて面白かった、ということにしておこう。

＜雑談＞HuBMAPのコンペが終わって、結果を見て、参加者のコメントを見ていると、みんなの努力はなんだったんだろうな、と思ってしまう。public_test_dataとprivate_test_dataとでスコアが大きく違うって、何なんだろうと思う。こんなtest_dataを使ってコンペをするのはどうなんだろう。どちらのtest_dataに対してもハイスコアになるモデルを表彰するならまだしも、偏ったスコアが出てしまうようなtest_dataを使って、片方のteat_data（private_teat_data）のスコアだけで評価するというのはものすごく不公平で、間違ったやり方ではないだろうか。このような偏りが生じた原因は、憶測だが、当初準備したデータセットに不備が見つかったため、急遽、データセットを作り直したためではないだろうか。このような現象が生じることを想定していたのだろうか。結果として、自分への影響は殆どなかったが、Kagglerの事後コメントを読んでいると、特にトップレベルのチームの方々にとっては、ほんとうに残念なコンペだったのではないかと思う。

f:id:AI_ML_DL:20210301083105p:plain — style=164 iterations=500

f:id:AI_ML_DL:20210301083248p:plain — style=164 iterations=50

f:id:AI_ML_DL:20210301083347p:plain — style=164 iterations=5

AI_ML_DL’s diary

人工知能、機械学習、ディープラーニングの日記

Persistent Homologyって何だろう？

燃料電池と機械学習（fuel cell and machine learning）：2021年7月下旬～8月下旬

Kaggle散歩（2021年7月）

Congratulations!

Kaggle散歩（2021年6月）

Kaggle散歩（2021年6月1日～8月10日：SIIM-FISABIO-RSNA COVID-19 Detection）

Kaggle散歩（5月11日～6月3日）

Kaggle散歩（1st March to 10th May 2021）