Kaggle散歩（2021年6月1日～8月10日：SIIM-FISABIO-RSNA COVID-19 Detection）

2021年6月1日～8月10日：

SIIM-FISABIO-RSNA COVID-19 Detection

Identify and localize COVID-19 abnormalities on chest radiographs

今日から2か月と10日間、このコンペに取り組む。

このコンペは、コンペ内の全てのcodeとdiscussionを見て学ぶことに集中する。

In this competition, you’ll identify and localize COVID-19 abnormalities on chest radiographs.

In particular, you'll categorize the radiographs as negative for pneumonia or typical, indeterminate, or atypical for COVID-19.

You and your model will work with imaging data and annotations from a group of radiologists.

6月1日（火）

Society for Imaging Informatics in Medicine (SIIM)：281 teams, 2 months to go

概要：胸部レントゲン写真からCOVID-19感染者の症状を推測する。

discussion：

過去の類似コンペのトップレベルの解法の紹介がある。

データサイズは約120 GB。画像サイズが大きくてそのままでは扱えないので、1000 x 1000以下のサイズに縮小し、データセットにして公開している人がいる。非常にありがたい。データセット作成に用いたコードが公開されているので、それを参考にして、自分で準備することも良い練習となる。

ケロッピアバターのGMさんが、starter kitを作成しますと宣言。BMSコンペでも行っていて、トップ100位以内のレベルのコードを公開しており、当人のチームは20位以内のハイレベル。

ベストスコアのコードを検索。trainも公開されているコードを探す。この時点では、学ぶだけ。

ベストスコアで検索し、スコアは低いが正確なコードを選び、コピー、実行、commit, submitにより、リーダーボードの下位に顔を出す。LB=0.005

6月2日（水）

Society for Imaging Informatics in Medicine (SIIM)：291 teams, 2 months to go

Evaluation：

standard PASCAL VOC 2010 mean Average Precision (mAP) at IoU > 0.5を用いる。

In this competition, we are making predictions at both a study (multi-image) and image level.：スタディ（マルチ画像）と画像レベルの両方で予測、となってる。

Study-level labelsとImage-level labelsがある。

train_study_level.csvとtrain_image_level.csvの２つのファイルがある。

Study-level labels：

Studies in the test set may contain more than one label. They are as follows:

"negative", "typical", "indeterminate", "atypical"

For each study in the test set, you should predict at least one of the above labels.

The format for a given label's prediction would be a class ID from the above list, a confidence score, and 0 0 1 1 is a one-pixel bounding box.

Study-level labelsのSubmission Fileへの出力例

Id,　　　　　　　　　 PredictionString

2b95d54e4be65_study, negative 1 0 0 1 1
2b95d54e4be66_study, typical 1 0 0 1 1
2b95d54e4be67_study, indeterminate 1 0 0 1 1 atypical 1 0 0 1 1

Image-level labels：

Images in the test set may contain more than one object.

For each object in a given test image, you must predict a class ID of "opacity", a confidence score, and bounding box in format xmin ymin xmax ymax.

If you predict that there are NO objects in a given image, you should predict none 1.0 0 0 1 1, where none is the class ID for "No finding", 1.0 is the confidence, and 0 0 1 1 is a one-pixel bounding box.

Image-level labelsのSubmission Fileへの出力例

Id,　　　　　　　　　 PredictionString

2b95d54e4be68_image, none 1 0 0 1 1
2b95d54e4be69_image, opacity 0.5 100 100 200 200 opacity 0.7 10 10 20 20

以上のように、スタディレベル（"negative", "typical", "indeterminate", "atypical"）と、イメージレベル（"non", "opacity"）の2つに分けられている。

公開コード（train）（EfficientNetB7使用、TPUで動作）のチューニングをやってみた。

Dropoutの導入、全結合層の導入、EfficientNetのサイズの変更、バッチサイズ、学習率の初期値などの変更を検討した。

今日は、オリジナルを超える結果は得られなかった。

TPUの代わりにGPUを使ってみた。EfficientNetB7は使えず、B4に変更した。画像も600 pixelではオーバーフローする。とりあえず、動かそうと思い、224 pixelにすれば、余裕で動くのだが、これが、非常に遅い。

あまりに遅いので、B0に変更して、5Foldまで計算してみた。B7に600 pixelの画像を流すのとは比較にならないだろうと思ったが、予測モデルをデータセットにしてinferenceコードに追加して予測した結果をsubmitしてみたところ、スコアの数値としては比較的近い値となった。

オリジナル：B7 ＆ 600 pixel & TPU --> LB=0.383

変　更　　：B0 & 224 pixel & GPU --> LB=0.373

良いモデルは、どう扱っても、そこそこのスコアになるものだが、B7 --> B0と、600 pixel --> 224 pixelとグレードダウンしても、LBスコアが大きくは下がらなかったことに驚いている。

コンペの問題点：

このコンペは、スタディレベル（"negative", "typical", "indeterminate", "atypical"）と、イメージレベル（"non", "opacity"）の２つの組み合わせ（病状の分類と病状箇所の検出）になっているのだが、イメージレベルの予測結果に対して、評価ミスが生じているようだ。イメージレベルの予測結果を追加・削除してcommitすることによって、イメージレベルの予測データの寄与は、LB=0.001あるいはLB=0.051となったとのこと）両方の寄与が五分五分だと仮定すると、LBスコアは、現在の0.4+レベルではなく倍の0.8+レベルになっていなければならない、ということのようである。

ということで、現状のリーダーボードは、ほぼ、スタディーレベルだけの評価になっている。Kaggle StaffのJulia Elliottさんによれば、来週には改善される予定。

このため、いまだ、参加チームが少ない、ということなのかもしれない。

6月3日（木）

Society for Imaging Informatics in Medicine (SIIM)：312 teams, 2 months to go

公開コード使用：B0 & 224 pixel & GPU--> LB=0.373

このコードはスタディーレベル（分類）専用なので、これ以上やっても意味ないかなとは思うのだが、イメージレベル（検出）は使うモデルが違うので（EfficientDetは両方できる？）今は、スタディーレベルの課題に集中しよう。

224 pixel & GPUで、Lrを変えたり、factor（学習率の減衰率）を変えたり、さらにはB0からB1やB2もやってみたが、途中経過がよくなかったので、最後まで計算せずに終えたこともあって、今日は、LB=0.373を超えることはなかった。

HuBMAPでは、Super-Convergenceを期待して、CyclicLRとOneCycleLRを使っていたので、ここでも使ってみようと思ったのだが、今使っているコードはTensorFlow/Kerasなので、このようなモジュールは無い。

Kerasで検索しても似たようなものは無かったが、TensorFlowに、Useful extra functionality for TensorFlow maintained by SIG-addons.というのがあり、そこに同等のモジュールがあった。

Module: tfa.optimizers

class AdamW: Optimizer that implements the Adam algorithm with weight decay.

class AveragedOptimizerWrapper: Base class for Keras optimizers.

class COCOB: Optimizer that implements COCOB Backprop Algorithm

class ConditionalGradient: Optimizer that implements the Conditional Gradient optimization.

class CyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class DecoupledWeightDecayExtension: This class allows to extend optimizers with decoupled weight decay.

class ExponentialCyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class LAMB: Optimizer that implements the Layer-wise Adaptive Moments (LAMB).

class LazyAdam: Variant of the Adam optimizer that handles sparse updates more

class Lookahead: This class allows to extend optimizers with the lookahead mechanism.

class MovingAverage: Optimizer that computes a moving average of the variables.

class MultiOptimizer: Multi Optimizer Wrapper for Discriminative Layer Training.

class NovoGrad: Optimizer that implements NovoGrad.

class ProximalAdagrad: Optimizer that implements the Proximal Adagrad algorithm.

class RectifiedAdam: Variant of the Adam optimizer whose adaptive learning rate is rectified

class SGDW: Optimizer that implements the Momentum algorithm with weight_decay.

class SWA: This class extends optimizers with Stochastic Weight Averaging (SWA).

class Triangular2CyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class TriangularCyclicalLearningRate: A LearningRateSchedule that uses cyclical schedule.

class Yogi: Optimizer that implements the Yogi algorithm in Keras.

明日は、CyclicalLearningRateを使ってみようと思う。

tfa.optimizers.CyclicalLearningRate(
initial_learning_rate: Union[FloatTensorLike, Callable],
maximal_learning_rate: Union[FloatTensorLike, Callable],
step_size: tfa.types.FloatTensorLike,
scale_fn: Callable,
scale_mode: str = 'cycle',
name: str = 'CyclicalLearningRate'
)

SGD（0.1-2.0）を想定して具体的な数値を入力した例：

clr = tfa.optimizers.CyclicalLearningRate(0.1, 2.0,
step_size=10,
scale_fn=lambda x: 1,
scale_mode= 'cycle',
name= 'CyclicalLearningRate'
)

＜雑談＞AI, Machin Learning, Deep Learning, Data Science, Engineer or Scientist or Programmer, Kaggler：これらの単語からイメージされる領域で、自分が目指す方向を表現するのに適しているのは何か。AIは、漠然としていてつかみどころがない。Data Scienceは、自分の中では前処理のイメージが強い。Wikipediaで調べてみよう。

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Data Scienceは、Machin Learning、Deep Learningだけでなく、あらゆる科学技術分野に対して、横断的に関連しているものと捉えることができるもののようである。

自分にはなじみの薄い用語、data miningについてもWikipediaで調べてみよう。

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.

勝手な解釈かもしれないが、data mining + big data = data scienceということにして、自分の現在および近未来の専門領域は、仮に、Data Scientistとしておこう。

最近の話題を調べるために、data scienceをキーワードに、Google Scholarで検索した。

雑談としては長すぎるような気もするが、面白そうなのでここに紹介しておく。

AutoDS: Towards Human-Centered Automation of Data Science

Dakuo Wang et al., arXiv:2101.05273v1 [cs.HC] 13 Jan 2021

Abstract

Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.

データサイエンス (DS) プロジェクトは、多くの場合、データサイエンティストと分野の専門家 (データ探索、モデルトレーニングなど) の骨の折れるタスクで構成されるライフサイクルに従います。最近まで、機械学習 (ML) の研究者は、これらのタスクでデータワーカーを支援する有望な自動化手法を開発してきました。このホワイトペーパーでは、最新の ML 自動化手法を活用してデータサイエンスプロジェクトをサポートすることを目的とした自動機械学習 (AutoML) システムである AutoDS を紹介します。データワーカーはデータセットをアップロードするだけで、システムは ML 構成を自動的に提案し、データを前処理し、アルゴリズムを選択し、モデルをトレーニングできます。これらの提案は、Web ベースのグラフィカルユーザーインターフェイスとノートブックベースのプログラミングユーザーインターフェイスを介してユーザーに表示されます。私たちは 30 人のプロのデータサイエンティストと共に AutoDS を研究しました。一方のグループは AutoDS を使用し、もう一方のグループは使用しませんでした。データサイエンスプロジェクトを完了しました。予想どおり、AutoDS は生産性を向上させます。しかし驚くべきことに、AutoDS グループによって作成されたモデルの方が品質が高く、エラーが少なくなっていますが、人間の信頼スコアは低いことがわかりました。データサイエンスライフサイクルにおける人間の作業に自動化手法を組み込むための設計の意味を示すことにより、調査結果を反映します。by Google翻訳

6月4日（金）

Society for Imaging Informatics in Medicine (SIIM)：335 teams, 2 months to go

オリジナル：B7 ＆ 600 pixel & TPU --> LB=0.383

B0 & 224 pixel & GPU--> LB=0.373

B1 & 240 pixel & GPU--> LB=0.377

B2 & 260 pixel & GPU--> LB=0.382

6月5日（土）

Society for Imaging Informatics in Medicine (SIIM)：348 teams, 2 months to go

B2 & 260 pixel & GPU--> LB=0.382

TPUが使えるようになったので、GPUと同一条件で計算したが、途中経過がよくなかった。0.77+となるべきところが、0.75+となった。

TPUでは、bfloat16を宣言して使う必要があるらしい。

image = tf.cast(image, tf.bfloat16)

これによって、GPUのfloat32と同等の精度になるらしい。

さらに、KaggleのTPUは8ユニットの並列動作なので、バッチ数をチェックしないと、GPUの場合の8倍になって、実効的な学習率が下がるなどのために、GPUの場合より精度が下がることがある。

少し条件を検討してみたが、次の結果となり、GPUでのLB=0.383には届かなかった。

B2 & 260 pixel & GPU：LB=0.382

B2 & 260 pixel & TPU：LB=0.375

さらに検討する必要がある。なお、この場合、TPUの計算速度はGPUの約4倍であった。

GPUでは、TPUよりも小さなモデルと画素数の組み合わせで、同等の予測精度が得られた。しかしながら、GPUでは、さらに大きなモデルと画素数の組み合わせにすると、メモリーオーバーになる。これについては、バッチ数を下げることで対応できるかどうか調べる予定である。

TPUでは並列計算により、GPUより高速に計算できるのでより大きなモデルと画素数の組み合わせで計算できるようであるが、同レベルのモデルと画素数の組み合わせではGPUより精度が落ちるようである。原因の1つは、bfloat16を使っていなかったことにあることがわかった。条件を同じにすれば、精度は同じになる筈なので、見落としが無いか調べる必要がある。

data_augmentationの検討：

def build_augmenter(with_labels=True):
def augment(img):
img = tf.image.random_flip_left_right(img)
img = tf.image.random_flip_up_down(img)
return img
def augment_with_labels(img, label):
return augment(img), label
return augment_with_labels if with_labels else augment

TensorFlowのData augmentationの説明ページに掲載されているrandom系の例

tf.image.stateless_random_brightness
tf.image.stateless_random_contrast
tf.image.stateless_random_crop
tf.image.stateless_random_flip_left_right
tf.image.stateless_random_flip_up_down
tf.image.stateless_random_hue
tf.image.stateless_random_jpeg_quality
tf.image.stateless_random_saturation

brightnessを追加してみたが、エラーが出て、うまくいかなかった。

tf.image.stateless_random_brightness(image, max_delta=0.5, seed=new_seed)

6月6日（日）

Society for Imaging Informatics in Medicine (SIIM)：366 teams, 2 months to go

公開コードを用い、次の組み合わせで、スタディーレベル（分類）の訓練を予測をやってみる。それぞれに最適条件はあるだろうと思うが、モデルと解像度以外は同一条件としてみた。batch_size=32, Adam(1e-3),

B0 & 224 pixel & TPU：LB=0.367 : 1.000

B1 & 240 pixel & TPU：LB=0.374 : 1.019

B2 & 260 pixel & TPU：LB=0.375 : 1.022

B3 & 300 pixel & TPU：LB=0.378 : 1.030

B4 & 380 pixel & TPU：LB=0.381 : 1.038

B5 & 456 pixel & TPU：LB=0.375 : 1.022

B6 & 528 pixel & TPU：LB=0.379 : 1.033

B7 & 600 pixel & TPU：LB=0.369 : 1.005

B0 ＆224からB4 & 380までは、予想通り、期待通りだと思う。

ところが、その先B5 & 456からB7 & 600 の結果には、大半の人は首を傾げると思う。

EfficientNetの論文のFigure 1には、B0からB7までImagenet Top-1 Accuracyが高くなる様子が図示されている。この図を見ると、誰でも番号の大きいモデルを使いたいと思うはず。この論文が示していることと、上記の実験結果との不一致の原因を調べよう。

EfficientNetの論文のFigure 1は、次のように計算されたと書かれている。

5.2. ImageNet Results for EfficientNet
We train our EfficientNet models on ImageNet using similar settings as (Tan et al., 2019): RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99;
weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed AutoAugment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with drop connect ratio 0.3. As commonly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.

最後の3行が特に重要で、モデルが大きいほど、regularizationの重要性が高くなり、dropout ratioはB0の0.2をB7では0.5にしたと書かれている。

Figure 1だけ見て大きなモデルを使おうとしてもだめだということがわかる。

具体的にどうすれば良いのか。

１．RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs.

２．swish activation

tf.keras.activations.swish

３．fixed AutoAugment policy

https://github.com/google/automl/blob/master/efficientnetv2/autoaugment.py

AutoAugment: Learning Augmentation Strategies from Data

RandAugment: Practical automated data augmentation with a reduced search space

４．stochastic depth (Huang et al., 2016) with drop connect ratio 0.3

Stochastic depth aims to shrink the depth of a network during training, while keeping it unchanged during testing. We can achieve this goal by randomly dropping entire ResBlocks during training and bypassing their transformations through skip connections.

５．linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7

大きなモデルに見合った結果を得るには、相当の準備が必要だということが分かった。

GitHub：google/automl/efficientnetv2

EfficientNetV2

May13/2021: Initial code release for EfficientNetV2 models: accepted to ICML'21.

1. About EfficientNetV2 Models

EfficientNetV2 are a family of image classification models, which achieve better parameter efficiency and faster training speed than prior arts.

Built upon EfficientNetV1, our EfficientNetV2 models use neural architecture search (NAS) to jointly optimize model size and training speed, and are scaled up in a way for faster training and inference speed.

f:id:AI_ML_DL:20210606141805p:plain

6月7日（月）

Society for Imaging Informatics in Medicine (SIIM)：386 teams, 2 months to go

B5 & 456 pixel & TPU：LB=0.375（Batch_size=32）

これは、Batch_size=32でtrainingしたもの。Batch_sizeによる違いを調べてみよう。TPUの残り時間が少ないので、まずは、Batch_size=128にしてみよう。

結果は、LB=0.375、となった。32でも128でもLBスコアは変わらなかった。

B7 & 600 pixel & TPUは、batch_size=32ではLB=0.369となり、128ではLB=0.383であった。よくわからない。何か間違ったのだろうか。

B4 & 380 pixel & TPU $ batch_size=32：LB=0.381

このモデルのAdamをAdamW（weight_decay=1e-4)にしてみた。

その結果、LB=0.384となった。

LookaheadをAdamWとの組み合わせで試した。このとき、パラメータは、デフォルトもしくはexample of usageなどに例示されている値を使った。計算の途中経過が明らかに良くなかったので、Lookahead(AdamW)によるtrainingは途中で中止した。

CyclicLRとdata_augmentationのことをすっかり忘れている。

6月8日（火）

Society for Imaging Informatics in Medicine (SIIM)：402 teams, 2 months to go

TPUを使い切ったので、GPUによる検討に戻ろう。

B2 & 260 pixel & GPU：LB=0.382（batch_size=16）, Adam, 1e-3,

random_brightnessはこれで効果を確認する。

img = tf.image.rabdom_brightness(img, 0.2)

AdamはAdamW(lr=1e-3, weight_decay=1e-4)で試してみる。

training終了間際で、計算が停止していた。Go to Viewerも使えない。どうしたのだろうか。いずれにしても、約3時間が無駄になった。

イメージレベルが手つかずのままだ。検出は、本格的に取り組んだことがないので、お手本コードを探して勉強しよう。イメージレベルの予測結果のスコアリングの不具合もまだ修正されていないようだ。

6月9日（水）

Society for Imaging Informatics in Medicine (SIIM)
：415 teams
, 2 months to go

TPUの使用時間リセット待ち！

6月10日（木）

Society for Imaging Informatics in Medicine (SIIM)：439 teams, 2 months to go

＊＊＊＊＊中断＊＊＊＊＊

6月15日（火）

SIIM：496 teams, 2 months to go (8/9)

ようやく、問題点の修復が進み始めたようだ。

みなさん、こんにちは。今しばらくお待ちいただきますようお願いいたします。ホストチームはいくつかのテストラベルを更新しました。また、画像レベルのラベルのスコアリングに影響を与える問題も修正しました。フィードバックをお寄せいただきありがとうございます。リーダーボードを更新するプロセスを開始します。すべての提出物を再実行するため、これには時間がかかることに注意してください。 by Google翻訳

0.45+で頭打ちになっていたのが、徐々にスコアアップし、今のトップは、0.62となっている。自分が提出している提出データは、テストレベルの予測データのみで、画像レベルの予測データはデフォルトのままである。

ということで、今日から、画像レベルの予測にとりかかる。といっても、公開コードに頼りっぱなしだが。

6月16日（水）

進捗なし

6月17日（木）

進捗なし

6月18日（金）

SIIM：571 teams, 2 months to go

Image-levelの予測結果に対するスコアの再計算が進み、160件くらいが0.500を超えている。

スコアの再計算によって、Study-levelのスコアが0.055アップしていた。（0.384が0.439になっていた。）

最初はそう思ったのだが、そうではなくて、Image-levelの予測結果を、空欄ではなく、デフォルト値にしていたので（公開コードがそうなっていた）、Image-levelのスコアを正しく計算できるようになったことから、Image-levelのデフォルトデータのスコア：0.055が加算された、というのが、スコアアップの原因だとわかった。

＊＊＊散歩中＊＊＊

6月26日（土）

SIIM：706 teams, a month to go

このコンペに対しては、チューニングではなく、モデルの構築段階に踏み込んで、良い結果を出したいと考えている。考えてはいるのだが、難しい。ラベルは専門家が付けているが正確ではない。モデルは、ラベル以上に正確では、スコアは上がらず、かえって下がる。正解のラベルも正確ではないからである。かといって、このモデルが予測した結果こそが真実だ、というためには、そのことを証明しなければならない。専門家の判断を超える判断能力を備えるためには、専門家が見逃している、あるいは、専門家にも見えない情報をを増幅して見えるようにする、あるいはより正しく診断するために考慮しなければならない画像間の相関を見つけ出すことが必要になるのだろう。さらに、それらの証拠を見える化しなければならない。画像診断では、そういうことも起こりうるような気はするが、現状のレベルでは、100％、overfittingである。

7月1日（木）

あと40日

8月1日（日）

寄り道しているうちに、戻ってこれなくなった。

残念だが、ここで終了する。