Kaggle散歩（5月11日～6月3日）

5月11日～6月3日

Bristol-Myers Squibb – Molecular Translation
Can you translate chemical images to text?

参加する期間は約20日：コンペの課題は、化学構造式の画像から、InChI(International Chemical Identifier)形式のテキストデータを推測すること。

＊3月8日にこのコンペに参加し、そのまま放置していた。

5月11日（火）

Bristol-Myers Squibb：647 teams, 23 days to go

参加チーム数が比較的少ない。課題が専門的過ぎるのだろう。これほど専門的な課題でも、これだけのチームが参加しているというのは、すごいことかな、とも思う。

さて、このコンペ、どのように取り組もうか。

目的：省略

目標：トップ5％以内に入ること。　⇒　取り組み方を間違ったため、目標は、この論文”End-to-End Attention-based Image Captioning”の内容を理解すること、及び、コンペ終了後にトップレベルの解法をできるだけ深く学ぶこと、とする。　⇒　途中でコンペから離れてReinforcement Rearningに取り組むなど、まとまりのないものになってしまった。最後に時間があれば、公開コードのtrainとinferenceを走らせて、実際のコードから少しでも学ばせていただこう。

方法：省略

スコアの計算： Levenshtein distance：レーベンシュタイン距離は、二つの文字列がどの程度異なっているかを示す距離の一種である。編集距離（へんしゅうきょり、英: edit distance）とも呼ばれる。具体的には、1文字の挿入・削除・置換によって、一方の文字列をもう一方の文字列に変形するのに必要な手順の最小回数として定義される[1]。by ウィキペディア

正解がabcde、解答がbbccdだとすると、正解に一致させるためには、2文字を置換する必要があるので、スコアは、2となる。正解の場合のスコアは0となり、5文字全部間違えるとスコアは5となる。文字数が異なると、挿入回数、削除回数が加算される。

リーダーボードのスコアは、sample_submission.csvをそのままsubmitすると、109.63となる。現状、100位が3.70、50位が2.09、Goldが1.13以内、トップが0.65となっている。

これらのスコアを大雑把に評価すると、トップのチームは、100点満点で99点以上、50位でも98点くらいということなので、非常にレベルが高い。

ここまで正確に変換することができるのか。すごいことだな。ここに割って入るのは無理だなと思ってしまう。中途半端なやりかたでは、トップ5％以内には、入れないような気がする。

先のHuBMAPコンペのような”大波乱”がおきるようなことは考えられない。

化学構造式と対応するInChI表現の例。by ウィキペディア

f:id:AI_ML_DL:20210511153524p:plain

InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1

ウィキペディアから借用した構造式は非常に綺麗だが、コンペで提供される構造式は、古い書籍から転載を繰り返したような画像が多く、N, O, S, Pなどの元素記号の識別が容易でない図も含まれている。

このコンペと全く同じ内容の論文が発行されている。

End-to-End Attention-based Image Captioning

Carola Sundaramoorthy, Lin Ziwen Kelvin, Mahak Sarin, and Shubham Gupta

arXiv:2104.14721v1 [cs.CV] 30 Apr 2021

Abstract
In this paper, we address the problem of image captioning specifically for molecular
translation where the result would be a predicted chemical notation in InChI format
for a given molecular structure. Current approaches mainly follow rule-based or
CNN+RNN based methodology. However, they seem to underperform on noisy
images and images with small number of distinguishable features. To overcome this,
we propose an end-to-end transformer model. When compared to attention-based
techniques, our proposed model outperforms on molecular datasets.

概要
この論文では、特に分子翻訳のための画像キャプションの問題に取り組みます。この場合、結果は、特定の分子構造に対してInChI形式で予測される化学表記になります。現在のアプローチは、主にルールベースまたはCNN + RNNベースの方法論に従います。ただし、ノイズの多い画像や識別可能な特徴の数が少ない画像では、パフォーマンスが低下しているように見えます。これを克服するために、エンドツーエンドのtransformer モデルを提案します。 Attentionベースの手法と比較すると、提案されたモデルは分子データセットよりも優れています。by Google翻訳

f:id:AI_ML_DL:20210511172043p:plain

f:id:AI_ML_DL:20210511172235p:plain

この論文での最高スコアは6.95で、現在のリーダーボードでは231位相当である。

コンペも終盤にさしかかっているので、コンペサイトには、非常にレベルの高い情報と公開コードが集まっている。

5月12日（水）

Bristol-Myers Squibb：654 teams, 22 days to go

Discussion：

このコンペ、問題がおきていたようだ。

分子構造の画像SMILES変換AIコンテスト（Dacon molecular to smiles competition, 2020.09.01 ~ 2020.10.09）が韓国で行われ、トップ３のコードに関する情報がオープンになっていて、その情報がKaggleのコンペ内で、ある期間共有されていた。

SMILESとInChIとではフォーマットが異なっているが、高性能な予測モデルの作成にとって参考になる情報が含まれていて、それを利用してハイスコアを得ているチームがあったということのようである。

その状況を不公平と感じた方が、Discussionコーナーにサマリーを掲載し、さらに、他の方々も加わって、トップ３の情報に誰でもアクセスできるようにした、ということのようである。

トップ３のコードがデータセットに格納され、公開されていることを確認した。これを活用できるのは、レベルの高い人に限られるだろうな。自分には難しすぎるように思うが、チャレンジしてみよう。

5月13日（木）

Bristol-Myers Squibb：673 teams, 21 days to go

分子構造をInChIコードに変換するモデル：

分子構造は画像として与えられる。その画質は、新しい教科書に掲載されているような鮮明な画像ではなく、コピーを繰り返して不鮮明になった画像である。O, N, P, Sなどの元素記号は不鮮明であり、1重結合と2重結合が見分けにくいものがあり、立体構造を表す結合なども不鮮明なものがある。斑点状のノイズがのっている。

不鮮明な画像からInChIコードを推定する前に、鮮明な分子構造モデルを作成する。画像を鮮明にするだけでなく、全ての炭素原子を可視化する。各炭素原子の位置にInChIコードに対応する番号を表示する。原子を色分けして、識別しやすくする。

InChIコードを鮮明な分子構造モデルに変換する。元画像とInChIコードから変換した鮮明な分子構造モデルのペアを用いて、元画像を鮮明な分子構造モデルに変換するためのモデルを作る。

鮮明な分子構造をInChIコードに変換することができるモデルに、テスト画像から変換した鮮明な分子構造を入力することにより、正確なInChIコードに変換することができる。

元画像⇒モデルA⇒分子構造⇒モデルB⇒InChIコード。

このようなことは、実際に、できるのだろうか。

5月14日（金）

Bristol-Myers Squibb：688 teams, 20 days to go

けろけろけろっぴのアバターをもつGMの方のDiscussionでの解説と引用文献等をフォローしようと思う。

5月15日（土）

Bristol-Myers Squibb：697 teams, 19 days to go

InChIコードから構造式を描くことはできるのだろうか。

RDKitを使ってInChIコードから分子構造を描くことができるコードが公開されている。

RDKitのDocumentationのトップページにInChIは存在せず、SMILESは11件存在する。

search pageから検索すると、Search finished, found 34 page(s) matching the search query.と表示され、InChIを含むページが34ページあることがわかる。

rdkit.Chem.inchi module

rdkit.Chem.inchi.MolFromInchi(inchi, sanitize=True, removeHs=True, logLevel=None, treatWarningAsError=False)¶
Construct a molecule from a InChI string

これは、InChIコード（文字列）から分子を構築する命令のようである。

構築した分子を描くためには、次のモジュールを使うようだ。

rdkit.Chem.Draw.rdMolDraw2D module

documentationを読めばすぐにコードが書けるわけではない。少なくとも、C++とPythonコードを自由に操ることができるようでないと使えそうもない。

お手本の公開コードがあるので、それを理解しながら、必要な作業を進めよう。

InChICコードからデータベースの画像よりも鮮明な分子構造画像が得られても、その分子構造画像からInChIコードに変換するモデルを作るのは容易ではない。

次の論文が参考になりそうだ。

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI,

Noel M O’Boyle, Journal of Cheminformatics 2012, 4:22

f:id:AI_ML_DL:20210515235510p:plain

Figure 1 An overview of the steps involved in generating Universal and Inchified SMILES. The normalisation step just applies to Inchified SMILES. To simplify the diagram a Standard InChI is shown, but in practice a non-standard InChI (options FixedH and RecMet) is used for Universal SMILES.

この図のように、原子位置の番号を表示した画像を発生することができれば、InChIコードへの変換を、より正確に行えるモデルを作ることができるように思う。

この論文は2012年に発行されているので、韓国で行われたSMILES形式のコードへの変換は、ここに示されているような手順を参考にして行われた可能性が高いように思われる。Kaggleのコンペにおいても、このスキームを利用できるのだろうと思う。

5月16日（日）

Bristol-Myers Squibb：703 teams, 18 days to go

Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI, Noel M O’Boyle, Journal of Cheminformatics 2012, 4:22

この論文に学ぶ：

主題は、InChIをベースにして、SMILES表現を標準化すること。

InChIとSMILESは、1行で分子構造を表現し、分子構造の情報を、格納し、表現し、コンピュータ入力するなどの目的に適していて、かつ、分子が一義的に決まるものであり、さらに、コンピュータにも人にも理解できること、簡潔であることなどが求められる。

InChIは、International Chemistry Identifierの名称が示しているように、国際標準化を目指して（NISTを推進母体として）IUPACが推進してきたもののようである。

SMILESは、Simplified Molecular Input Line Entry Systemの名称が示しているように、単純化された分子のライン入力システムで、InChIよりも簡潔で、直観的に把握しやすいものである。

SMILES : C(=O)([O-])C(=O)O

Standard InChI : InChI=1/C2H2O4/c3-1(4)2(5)6/h(H,3,4)(H,5,6)/p-1

論文の1ページ目の最後の方には、SMILESとInChI以外の1行記述方法が列挙されている。SLN, ROSDAL, WLN, MCDLなど。InChIKeyというのもある。ウィキペディアによれば、「hashed InChIとも呼ばれるInChIKeyは、25文字の固定長であるが、デジタル表現なので人間には読むことができない。InChIKeyの仕様は、ウェブでの検索を可能にするために、2007年9月にリリースされた[5]。InChIそのものとは異なり、InChIKeyは一意ではなく、非常に稀ではあるが重複が発生する[6]。」

韓国のコンペで使われたSMILESは、2012年の時点では、最もポピュラーな1行表記方法”The SMILES format is the most popular line notation in use today."とのことであるが、課題は、立体構造の表現が困難であることの他に、標準化されていないことだということで、InChIをベースに標準化を提案しているのがこの論文の内容となっている。SMILESの標準化が進まなかった原因として、提案された方法が立体構造に対応していない、開発品の専売化、フリーソフトは互いに互換性が無く出版もされなかったことなどがあげられている。

1999年にNIST（National Institute of Standards and Technology）において、分子の新しい1行表記方法の開発がすすめられ、InChIが国際標準として提案されたようである。

この論文では、このInChIからの標準的なラベルを用いて、標準的なSMILESを生成するという方法をとっている。オープンソースの様々なケモインフォマティクスライブラリー（Open Babel, Chemistry Development Kit, RDKit, Chemkit, Indigoなど）に含まれるコードを用いることができるようである。

3ページ目には、Inchified SMILESとUniversal SMILESの2種類が定義されている。

論文ではOpen Babel chemoinformatics toolkitを用いた説明となっている。Kaggleの公開コードではRDKitを使っている。

Methods：次の4つのステップにおいて、図の表示、原子位置の番号表示、原子位置の番号の変更などを行うのか？そうではなさそうだ。

structure normalization

canonical labeling

graph traversal

SMILES generation

Fig.1の図は、InChIからSMILESへの変換プロ節の説明のために分子構造を図示し、原子位置番号も理解を助けるために表示しているだけで、実際の変換作業は、Open Babelのobabelコマンドラインプログラムで瞬時に変換されるだけのようにみえる。

The following commands show how to use the obabel command-line program to generate Universal and Inchified SMILES strings for a structure stored in a Mol file:
C:\>obabel figure1.mol -osmi –xU
c1cc(/C=C/F)cc(c1)[N+](=O)[O-]
C:\>obabel figure1.mol -osmi –xI
c1cc(/C=C/F)cc(c1)N(=O)=O

明日は、InChIから分子構造を表示する方法を（公開コードから）学び、得られた分子構造画像のデータセットを作ろう。

5月17日（月）

Bristol-Myers Squibb：715 teams, 17 days to go

もう少し論文を読み進めよう。

論文読んでいてもよくわからないので、OpenBabelを使ってみよう。

OpenBabel 2.4.0をインストールした。

OpenBabelGUIが表示される。左側から、INPUT FORMAT, CONVERT, OUTPUT FORMATとなっている。次の命令をOpenBabelGUI上で実行してみよう。

C:\>obabel figure1.mol -osmi –xU
c1cc(/C=C/F)cc(c1)[N+](=O)[O-]

まずは、左側の入力画面のフォーマットをInChIに、右側の出力画面のフォーマットをSMILESにする。

次に、左側の画面に、論文のFig.1に示されているInChIコードを入力する。

InChI=1S/C8H6FNO2/c9-5-4-7-2-1-3-8(6-7)10(11)12/h1-6H/b5-4+

中央のCONVERTボタンをクリックすると、SMILESコードが出力された。

c1cc(/C=C/F)cc(c1)N(=O)=O

次に、左側は、InChIコードのままで、出力をpng -- PND 2D depictionとし、出力ファイル名を指定しておく。

この状態で中央のCONVERTボタンをクリックすると、指定しておいたファイルの中に次の画像が保存された。

f:id:AI_ML_DL:20210517112650p:plain

それならばと、コンペのデータセットのInChIコードを左側の画面に張り付けて、出力ファイル名を設定しなおして、CONVERTをくりっくすると次のように分子が出力された。

InChI=1S/C13H15NO2/c1-10(14-11(2)16)13-7-5-12(6-8-13)4-3-9-15/h5-8,10,15H,9H2,1-2H3,(H,14,16)

f:id:AI_ML_DL:20210517113405p:plain

分子構造の生成パラメータが詳細に決定できるようになっている。

炭素原子を明示することもできる。

InChI=1S/C38H40N4O6S2/c43-35-25-47-31-17-5-1-13-27(31)37(45)39-21-9-10-22-40-38(46)28-14-2-6-18-32(28)48-26-36(44)42-30-16-4-8-20-34(30)50-24-12-11-23-49-33-19-7-3-15-29(33)41-35/h1-8,13-20H,9-12,21-26H2,(H,39,45)(H,40,46)(H,41,43)(H,42,44)

f:id:AI_ML_DL:20210517121736p:plain

RDKitを用いてInChIコードから分子構造を得るのと同じことを、OpenBabelGUIでできることがわかった。ただし、現状では、InChIコードを1件づつしか処理できない。

InChIコードから綺麗な分子構造に変換できて炭素原子の表示も可能であるが、原子位置の表示はできていない。

とりあえず、train_dataの分子構造図を、InChIコードラベルを使ってクリヤ―な画像に変換できたとして、当初考えていたことが可能かどうか検証してみよう。

test_dataの画像をクリヤーにする方法：

コンペから遠くはなれてしまいそうだが、画質を向上させるということでは、GANやNeural Style Transferなどが使えそうに思うが、これらのコードを使ったことはあるが、その本質は理解できていないように思うので、Generative Deep Learningのテキストを読んでみる。

Generative Deep Learning

by David Foster

この本を読み終えるまでにこのコンペが終了しているかもしれない！何か使える手法はないか探してみよう。何か役に立つ情報が隠れているかもしれない。すぐには使えなくてもいい、次のステップが見えてくるだけいい・・・。

ほんの少しだけ頭を使ったらわかったことは、100％正解できることを、わざわざディープラーニングを使ってやってみようとしているだけのようにみえるということである。このコンペの課題は、本質的には、確率の世界ではなく、決定論の世界である。たとえば、四則演算の規則を教えないで、式の画像と数値の正解のセットを用いてCNN＋RNN/LSTMでtrainingして式の画像から正しいラベルを推定させるのと同じ。同様のことは、人工知能に東京大学の入試問題を解かせようとして成功とまではいかなかったことにも通じるところがある。問題が画像として与えられるために、決定論の問題であっても、確率論の問題におきかわってしまうか、ある確率でしか問題の意図を把握することができないということが、点数があるレベルを超えられなくなる原因になっていたのかもしれない。受験者が持っている知識レベルに到達するために必要な知識をどうやって保持させるかを考えた時、問題ごとにあるいは問題の要素ごとに、問題を解くことができるだけの能力を備えるために必要なデータセットを用意することが必要で、そのための膨大なデータセットを準備する手間が足りなかったのではないだろうか。

東大入学試験問題に限らず、あるいは入試問題に限らず、ニューラルネットワークによって様々な問題が解けるようにすることは、非常に重要なことかもしれない。それは、もしかしたら、自己学習につながるかもしれない。問題が解けるようになるということは、画像情報（試験問題）から疑似知識を習得することになり、算数のレベルから数学のレベルになり、物理、化学、生物などの自然科学から、論理学、経済学、経営学、法学、さらには哲学的的思考にまで発展していく可能性がある。ヒトはどうやって知識を獲得していくのか。ヒトはどうやって思考能力を獲得していくのか。思考の中身はなんだろう。会話の中身は単なる記憶情報の発出ではないのか。

マシンラーニングのラーニング手法についてもっとよく考えようと思うのだが、考える土台になるものを十分には理解できていないような気がしている。ラーニングマシンからシンキングマシンへ、さらにはクリエイティブマシン、リサーチマシンへと進化させていくためにどういう機能が必要なのか。さらに、これらが自己増殖するためにはどういう機能が必要になるのか。

空想（妄想）はこのへんにして、テキストに戻ろう。

今日はここまで・・・。

5月18日（火）

Bristol-Myers Squibb：727 teams, 16 days to go

Genetative Deep Learning by David Fosterの続き：

GenerativeとDiscriminative：自分の印象では、Generativeは似たものを作ることで、Discriminativeは見分けること。似ていることは、見分けにくいことで、見かたが異なるだけのようだが、Generative modeling processとDiscriminative modeling processとの最も大きな違いは、出力である。Generative modelが出力するのは、画像であり、音であり、テキストである。Discriminative modelが出力するのは、ラベルもしくは評価値である。当該コンペの出力はラベルである。

Generative modeling：データがどのように作られているのかを明らかにすることが重要。reinforcement learningの指導原理のように目的を達成するための手段や経路の探索において重要。generative modelingの究極の姿は今この本を読んでいる人間の脳の中がおきていることを理解し再現すること。

reinfoecement learning：Atariのゲームで人を負かすプログラムということで有名になり、その後チェス、将棋、碁においても人を負かすことになったモデル。その後、自動運転、創薬支援、タンパク質の構造解析などにも適用されているようである。重要な技術だということで学ぼうとしているのだが、その本質がよくわからない。

最初にテキストのChapter 8 Playを眺めてみる。

Reinforcement Learningの説明のあと、実例を用いた説明となるのだが、実例がゲームで、CarRacingである。これで、一気に、やる気が失せるんだな。どうしようか。

昨年の9月に、たくさんのコンペに参加した中に、Halite by Two Sigma, Collect the most halite during your match in spaceというのがあった。Reinforcement Learningを試すコンペのようなので、ここでReinforcement Learningを学ぼうと思った。しかし、同時期にいくつかのコンペにも参加していたので、各コンペに避ける時間があまりにも少なく、結局、RLの学習もスコアも中途半端で終わった。このコンペのことを思い出してそのサイトに行って、トップチーム（個人）の解説をざっと読んで驚いた。その方は、なんと、Reinforcement Learningを開発したDeep Mindにおられて熟知されていたようである。当然のことながらプログラミングレベルもはかり知れないものだろうと思う。RLがトップを狙うには不十分ということで、従来型のプログラミングで勝負したとのことである。なんと11,000行、とのこと。Reinforcement Learning自体がだめということではなく、学習時間が足りないということのようである。とてもまねできないなと思ったのは、対戦状況を観察して相手の戦術を読み取ってそれを凌駕する戦術を考えてプログラミングしたことである。

DEEP REINFORCEMENT LEARNING
Yuxi Li (yuxili@gmail.com), arXiv:1810.06339v1 [cs.LG] 15 Oct 2018

ABSTRACT
We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details. We discuss six core elements, six important mechanisms, and twelve applications, focusing on contemporary work, and in historical contexts. We start with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources. Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation. Then we discuss important mechanisms for RL, including attention and memory, unsupervised learning, hierarchical RL, multiagent RL, relational RL, and learning to learn. After that, we discuss RL applications, including games, robotics, natural language processing (NLP), computer
vision, finance, business management, healthcare, education, energy, transportation, computer systems, and, science, engineering, and art. Finally we summarize briefly, discuss challenges and opportunities, and close with an epilogue.

概要スタイルで深層強化学習について説明します。細部にまでこだわった全体像を描きます。現代の仕事に焦点を当て、歴史的な文脈で、6つのコア要素、6つの重要なメカニズム、および12のアプリケーションについて説明します。まず、人工知能、機械学習、深層学習、強化学習（RL）の背景とリソースを使用します。次に、価値関数、ポリシー、報酬、モデル、探索と活用、表現など、RLのコア要素について説明します。次に、注意と記憶、教師なし学習、階層型RL、マルチエージェントRL、リレーショナルRL、学習学習など、RLの重要なメカニズムについて説明します。その後、ゲーム、ロボット工学、自然言語処理（NLP）、コンピュータービジョン、財務、経営管理、ヘルスケア、教育、エネルギー、輸送、コンピューターシステム、科学、工学、芸術などのRLアプリケーションについて説明します。最後に、簡単に要約し、課題と機会について話し合い、エピローグで締めくくります。by Google翻訳

全150ページ、数式が多くて、さっぱりわからん。

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, Justin Fu arXiv:2005.01643v3 [cs.LG] 1 Nov 2020

Abstract
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.

このチュートリアル記事では、オフライン強化学習アルゴリズムの研究を開始するために必要な概念ツールを読者に提供することを目的としています。これは、追加のオンラインデータ収集なしで、以前に収集されたデータを利用する強化学習アルゴリズムです。オフライン強化学習アルゴリズムは、大規模なデータセットを強力な意思決定エンジンに変えることを可能にするという大きな可能性を秘めています。効果的なオフライン強化学習手法は、利用可能なデータから最大限の有用性を備えたポリシーを抽出できるため、医療や教育からロボット工学まで、幅広い意思決定ドメインの自動化が可能になります。ただし、現在のアルゴリズムの制限により、これは困難です。特に現代の深層強化学習方法の文脈で、読者にこれらの課題の理解を提供することを目指し、これらの課題を軽減するために最近の研究で探求されたいくつかの潜在的な解決策、最近のアプリケーション、および議論について説明しますフィールドの未解決の問題に関する視点の。by Google翻訳

全34ページ、数式が多くて、さっぱりわからん。

次の論文は、最初に紹介したテキストGenerative Deep Learning by David Fosterの第8章のReinforcement Learningのメインストーリーの土台となっているもので、非常に丁寧に書かれた必読文献だと思う。

World Models
David Ha and Jurgen Schmidhuber, arXiv:1803.10122v4 [cs.LG] 9 May 2018

Abstract
We explore building generative neural network models of popular reinforcement learning
environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer
this policy back into the actual environment.

人気のある強化学習環境の生成ニューラルネットワークモデルの構築を検討します。私たちの世界モデルは、教師なしの方法ですばやくトレーニングして、環境の圧縮された空間的および時間的表現を学習できます。ワールドモデルから抽出された特徴をエージェントへの入力として使用することで、必要なタスクを解決できる非常にコンパクトでシンプルなポリシーをトレーニングできます。エージェントを、その世界モデルによって生成された独自の幻覚の夢の中で完全にトレーニングし、このポリシーを実際の環境に戻すこともできます。by Google翻訳

Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal
model. Jay Wright Forrester, the father of system dynamics, described a mental model as:
The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system. (Forrester, 1971)

人間は、限られた感覚で知覚できるものに基づいて、世界のメンタルモデルを開発します。私たちが行う決定と行動は、この内部モデルに基づいています。システムダイナミクスの父であるジェイライトフォレスターは、メンタルモデルを次のように説明しました。私たちが頭に抱えている私たちの周りの世界のイメージは、単なるモデルです。彼の頭の中の誰も、すべての世界、政府、または国を想像していません。彼は概念とそれらの間の関係のみを選択し、それらを使用して実際のシステムを表現しています。（フォレスター、1971年）by Google翻訳

To handle the vast amount of information that flows through our daily lives, our brain learns an abstract representation of both spatial and temporal aspects of this information. We are able to observe a scene and remember an abstract description thereof. Evidence also suggests that what we perceive at any given moment is governed by our brain’s prediction of the future based on our internal model.

私たちの日常生活を流れる膨大な量の情報を処理するために、私たちの脳はこの情報の空間的側面と時間的側面の両方の抽象的な表現を学習します。シーンを観察し、その抽象的な説明を思い出すことができます。証拠はまた、私たちがいつでも知覚するものは、私たちの内部モデルに基づく私たちの脳の将来の予測によって支配されていることを示唆しています。by Google翻訳

5月19日（水）

Bristol-Myers Squibb：746 teams, 15 days to go

このコンペに対して、日々、どういうふうに取り組んでいるのかを書きとめているつもりだが、脇道に入り込んでいる状態で、今日も、脇道を進んでいく。

Reinforcement Learning：

学ぼうと思って何度かトライしたが、うまくいかない。昨日も、いくつかの論文に目を通し、Kaggleのコースを見たり、F. CholletさんやA. Geronさん、D. Fosterさんのテキストをながめていても頭に入ってこない。なんとなくわかってきたことは、Reinforcement Learningは、その起源が古く、ある意味、基本的な事項はわかっているものとして基礎的な説明が省略されているのではないかということである。

ということで、Google ScholarでReinforcement Learningを検索して、次の文献を読むことにした。25年前の論文である。

Reinforcement Learning: A Survey

Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore,

Journal of Artificial Intelligence Research 4 (1996) 237-285

Abstract
This paper surveys the field of reinforcement learning from a computer-science per-
spective. It is written to be accessible to researchers familiar with machine learning. Both
the historical basis of the field and a broad selection of current work are summarized.
Reinforcement learning is the problem faced by an agent that learns behavior through
trial-and-error interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but differs considerably in the details and in the use
of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.

この論文は、コンピュータサイエンスの観点から強化学習の分野を調査します。機械学習に精通した研究者がアクセスできるように書かれています。この分野の歴史的根拠と現在の研究の幅広い選択の両方が要約されています。強化学習は、動的環境との試行錯誤の相互作用を通じて行動を学習するエージェントが直面する問題です。ここで説明する作業は心理学での作業に似ていますが、詳細と使用法がかなり異なります。この論文では、探索と活用のトレードオフ、マルコフ決定理論によるフィールドの基盤の確立、遅延強化からの学習、学習を加速するための経験的モデルの構築、一般化と階層の利用、そして隠された状態への対処など、強化学習の中心的な問題について説明しています。それはいくつかの実装されたシステムの調査と強化学習のための現在の方法の実用性の評価で終わります。by Google翻訳+一部修正

本文30ページの1/3くらい目を通してみた感想は、キラーアプリが見つかっていないためなのか、議論が理論的、概念的で、具体性がなく、同じところをぐるぐる回っているように感じることである。過去に学ぶのも良いが入り込みにくい世界でもある。

自動運転でReinforcement Learningがどう使われているかを調べてみよう。Reinforcement Learningとゲームの組み合わせよりも、断然、やる気が出る。

Deep Reinforcement Learning for Autonomous Driving: A Survey
B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, and Patrick Pérez, arXiv:2002.00444v2 [cs.LG] 23 Jan 2021

Abstract—With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of
simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
Index Terms—Deep reinforcement learning, Autonomous driving, Imitation learning, Inverse reinforcement learning, Controller learning, Trajectory optimisation, Motion planning, Safe reinforcement learning.

要約—深層表現学習の開発により、強化学習（RL）のドメインは、高次元環境で複雑なポリシーを学習できる強力な学習フレームワークになりました。このレビューでは、深層強化学習（DRL）アルゴリズムを要約し、自動運転エージェントの実際の展開における主要な計算上の課題に対処しながら、（D）RLメソッドが採用されている自動運転タスクの分類法を提供します。また、関連しているが古典的なRLアルゴリズムではない、動作の複製、模倣学習、逆強化学習などの隣接するドメインについても説明します。トレーニングエージェントにおけるシミュレータの役割、RLの既存のソリューションを検証、テスト、および堅牢化する方法について説明します。
インデックス用語-深層強化学習、自律運転、模倣学習、逆強化学習、コントローラー学習、軌道最適化、動作計画、安全強化学習。by Google翻訳

The main contributions of this work can be summarized as follows:
・Self-contained overview of RL background for the automotive community as it is not well known.
・Detailed literature review of using RL for different autonomous driving tasks.
・Discussion of the key challenges and opportunities for RL applied to real world autonomous driving.
The rest of the paper is organized as follows.

Section II provides an overview of components of a typical autonomous driving system.

Section III provides an introduction to reinforcement learning and briefly discusses key concepts.

Section IV discusses more sophisticated extensions on top of the basic RL framework.

Section V provides an overview of RL applications for autonomous driving problems.

Section VI discusses challenges in deploying RL for real-world autonomous driving systems.

Section VII concludes this paper with some final remarks.

f:id:AI_ML_DL:20210519144158p:plain

f:id:AI_ML_DL:20210519220609p:plain

この表のキャプションは、OPEN-SOURCE FRAMEWORKS AND PACKAGES FOR STATE OF THE ART RL/DRL ALGORITHMS AND EVALUATION.（最先端のRL / DRLアルゴリズムと評価のためのオープンソースフレームワークとパッケージ。by Google翻訳）

この表の最後にあるDeepMindの"bsuit"の論文を見よう。

Behaviour Suite for Reinforcement Learning
Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado Van Hasselt, arXiv:1908.03568v3 [cs.LG] 14 Feb 2020
Abstract
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source
github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a
periodic review of bsuite from a committee of prominent researchers.
このホワイトペーパーでは、強化学習のためのBehavior Suite、または略してbsuiteを紹介します。 bsuiteは、2つの目的を持つ強化学習（RL）エージェントのコア機能を調査する慎重に設計された実験のコレクションです。まず、一般的で効率的な学習アルゴリズムの設計における重要な問題を捉えた、明確で有益でスケーラブルな問題を収集します。次に、これらの共有ベンチマークでのパフォーマンスを通じてエージェントの動作を調査します。この取り組みを補完するために、私たちはオープンソース
 github.com/deepmind/bsuite。bsuite上のエージェントの評価と分析を自動化します。このライブラリは、RLの主要な問題に関する再現性のあるアクセス可能な研究を促進し、最終的には優れた学習アルゴリズムの設計を促進します。私たちのコードはPythonであり、既存のプロジェクト内で簡単に使用できます。 OpenAIベースライン、ドーパミン、および新しいリファレンス実装の例が含まれています。今後は、研究コミュニティからのより優れた実験を取り入れ、著名な研究者の委員会による定期的なbsuiteのレビューに取り組んでいきたいと考えています。by Google翻訳

Interest in artificial intelligence has undergone a resurgence in recent years. Part of this
interest is driven by the constant stream of innovation and success on high profile challenges previously deemed impossible for computer systems. Improvements in image recognition are a clear example of these accomplishments, progressing from individual digit recognition (LeCun et al., 1998), to mastering ImageNet in only a few years (Deng et al., 2009; Krizhevsky et al., 2012). The advances in RL systems have been similarly impressive: from checkers (Samuel, 1959), to Backgammon (Tesauro, 1995), to Atari games (Mnih et al., 2015a), to competing with professional players at DOTA (Pachocki et al., 2019) or StarCraft (Vinyals et al., 2019) and beating world champions at Go (Silver et al., 2016). Outside of playing games, decision systems are increasingly guided by AI systems (Evans & Gao, 2016).

近年、人工知能への関心が復活しています。この関心の一部は、以前はコンピュータシステムでは不可能と考えられていた注目を集める課題に対する革新と成功の絶え間ない流れによって推進されています。画像認識の改善は、これらの成果の明確な例であり、個々の数字の認識（LeCun et al, 1998）からわずか数年でImageNetを習得する（Deng et al, 2009; Krizhevsky et al, 2012）まで進んでいます。 RLシステムの進歩も同様に印象的でした。チェッカー（Samuel, 1959）、バックギャモン（Tesauro, 1995）、アタリゲーム（Mnih et al, 2015a）、DOTAでのプロプレーヤーとの競争（Pachocki et al, 2019）またはStarCraft（Vinyals et al, 2019）およびGo（Silver et al, 2016）で世界チャンピオンを破っています。ゲームをプレイする以外に、意思決定システムはますますAIシステムによって導かれています（Evans＆Gao, 2016年）。by Google翻訳

As we look towards the next great challenges for RL and AI, we need to understand our systems better (Henderson et al., 2017). This includes the scalability of our RL algorithms, the environments where we expect them to perform well, and the key issues outstanding in the design of a general intelligence system. We have the existence proof that a single self-learning RL agent can master the game of Go purely from self-play (Silver et al., 2018). We do not have a clear picture of whether such a learning algorithm will perform well at driving a car, or managing a power plant. If we want to take the next leaps forward, we need to continue to enhance our understanding.

RLとAIの次の大きな課題に目を向けるとき、システムをよりよく理解する必要があります（Henderson et al, 2017）。これには、RLアルゴリズムのスケーラビリティ、それらが適切に機能すると予想される環境、および未解決の主要な問題が含まれます。
一般的なインテリジェンスシステムの設計において。単一の自己学習RLエージェントが純粋に自己プレイから囲碁のゲームを習得できるという存在証明があります（Silver et al, 2018）。そのような学習アルゴリズムが車の運転や発電所の管理でうまく機能するかどうかについては、明確な見通しがありません。次の飛躍を遂げたいのであれば、理解を深めていく必要があります。 by Google翻訳

これは、Reinforcement Learningを復活させたDeepMindの研究者らの論文である。DeepMindの創業者はゲームの達人だったようで、Reinforcement Learningを再発見し、コンピュータゲームとボードゲームを征服し、Reinforcement Learningをさらに発展させ、応用範囲を拡げているようである。

囲碁、将棋、チェスでヒトを超えた。車の運転や発電所の管理でも既にヒトを超えている。後者は前者にはないセンサーや計測器からのリアルタイム情報が必要だというだけで、基本は同じなんだろうか。

5月20日（木）

Bristol-Myers Squibb
, 749 teams
14 days to go

99％以上の正確さで分子構造のぼやけた画像からInChIコードに変換できるというのは、コード変換規則を正しく適用しているだけにしか見えないくらい凄いことだと思う。

自分の入る余地は無い。解法を楽しみに待つ。

Reinforcement Learning：

ひきつづき、"Behaviour Suite for Reinforcement Learning" by DeepMindに学ぼう。

1.1 Practical theory often lags practical algorithms：

The current theory of deep RL is still in its infancy. In the absence of a comprehensive theory, the community needs principled benchmarks that help to develop an understanding of the strengths and weakenesses of our algorithms.

ディープRLの現在の理論はまだ揺籃期にあります。包括的な理論がない場合、コミュニティは、アルゴリズムの長所と短所の理解を深めるのに役立つ原則的なベンチマークを必要としています。by Google翻訳

1.2 An ‘MNIST’ for reinforcement learning

Just as the MNIST dataset offers a clean, sanitised, test of image recognition as a stepping stone to advanced computer vision; so too bsuite aims to instantiate targeted experiments for the development of key RL capabilities.

1.3 Open source code, reproducible research

As part of this project we open source github.com/deepmind/bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite. This library serves to facilitate reproducible and accessible research on the core issues in reinforcement learning.

1.4 Related work

2 Experiments

2.1 Example experiment: memory length

f:id:AI_ML_DL:20210520162152p:plain

f:id:AI_ML_DL:20210520162248p:plain

読んで理解しようとしたが、課題が何なのか、次の語句が何なのか（DQNは見たことがあるという程度）、さっぱりわからない。
actor-critic with a recurrent neural network

feed-forward DQN

Bootstrapped DQN

A2C

A. Geronさんの第2版のテキストの18章 Reinforcement Learningに、これらの語句の説明があるので、次は、この章を見ていくことにする。全57ページ。

ゲームの話から始まるので、挑戦するたびに挫折したが、もう引くに引けない状況になった。これをクリヤ―しない事には、人工知能の世界では生き残れないので。

始まりは1950年ごろとある。私が生まれた頃だな。アッ、年がばれた。

注目されたのが2013年。DeepMindのAtariに対する取り組み。ゲームのルールの情報を与えることなく、画面情報だけから、ヒトに優るスコアを出したとのこと。

DeepMindはGoogleに$500 millionで買収された。

Deep Reinforcement Learningと称される手法の重要な技術要素は、policy gradientsとdeep Q networksと、Markov decision processである。

これらを、動くカート上のポールバランスに適用する。
TensorFlow-Agentsライブラリーを導入する。

このライブラリーを使って、Atariの有名なゲームBreakoutをプレーするエージェントを訓練する。

結局は、ゲームか、と思ってしまうが、これをクリヤしないと先はないということだと観念して取り組もう。Bristol-Myers Squibbコンペが終了するまでに。

Hands-One Machine Learning with Scikit-Learn, Keras & TensorFlow by A. Geron, Second Edition September 2019

Chapter 18 Reinforcement Learning

Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today, and also one of the oldest.

DeepMind applied the power of Deep Learning to the field of Reinforcement Learning, and it worked beyond their wildest dreams.

In this chapter we will first explain what Reinforcement Learning is and what it's good at, then present two of the most important techniques in Deep Reinforcement Learning: polycy gradients and deep Q-networks (DQNs), including a discussion of Markov decision processes (MDPs).

5月21日（金）

Bristol-Myers Squibb：756 teams, 13 days to go

休眠中：スコア1.0でも14位とは、すごいな。

分子構造の画像を修正するコードを見た。ドット状のノイズを消去する、直線の欠けたドットを補充する、周辺の不要部分をカットする、などだが、これに加えて、炭素原子を追加するとか、炭素の位置情報を追加するなどの作業をreinforcement learningでやるためにはどうすればよいのだろうか。

Learning to Optimize Rewords

agentは、設定された環境においてwithin an environment、状況を観察/把握しmake observation、行動するtakes actionsことによって、報酬rewardsを得る。

task例

ロボット：agentはロボットを制御するプログラム、environmentはロボットの行動範囲、環境の把握(make obsevations)はカメラによる画像やタッチセンサーからの信号によって行う、actionはロボットを移動させる信号の送付や動作させるための信号の送付、rewardsは目的地への到達や所定の動作や所要時間などの達成度となる。

Pac-Man：agentはPac-Manをコントロールするプログラム、environmentはAtari gameのシミュレーション（意味がわからない：わかった。ゲームに勝つagentをtrainingするために、実際のゲームと同じ動作をさせるということだ。：どのタスクにおいても、trainingするためには実際のenvironmentを仮想的に作り出すことが必要なのだ。これは、supervised learningにおけるdata-labelセットと同様の役割を担うということか。：自動運転のagentをtrainingするためには、実際に道路を走らせて画像やレーダー信号などを収集するする必要があるということ。）、actionはジョイスティックの位置設定、observationsはスクリーンショット、rewardsは得点となる。

囲碁：Pac-Manと同様、rewardsは、占有領域となる。

サーモスタット：わざわざreinforcement learningを導入して何をしようとするのかわかない。目的物の温度が設定温度に近いほどrewardsを高くする。目的物の温度はセンサーで検知する。加熱装置と冷却装置を交互に動作させるのか、過熱または冷却のみを動作させて設定温度に近づけるのか。キーワードは、smart thermosutatであった。

smart thermostat：ウィキペディアから関係ありそうな箇所を引用：

プログラム可能なスケジュールと自動スケジュール
スマートサーモスタットのプログラム可能なスケジュール機能は、標準のプログラム可能なサーモスタットの機能と似ています。ユーザーは、家から離れているときにエネルギー使用量を減らすためにカスタムスケジュールをプログラムするオプションが与えられます。ただし、調査によると、スケジュールを手動で作成すると、サーモスタットを設定温度に維持するよりも多くのエネルギー使用量につながる可能性があります。[8]この問題を回避するために、スマートサーモスタットは自動スケジュール機能も提供します。この機能では、アルゴリズムとパターン認識を使用して、乗員の快適さとエネルギー節約につながるスケジュールを作成する必要があります。スケジュールを作成すると、サーモスタットは乗員の行動を監視し続け、自動スケジュールを変更します。スケジューリングから人為的エラーを取り除くことにより、スマートサーモスタットは実際にエネルギーを節約するスマートスケジュールを作成できます。[13]

agentはサーモスタットの動作条件を制御するプログラムということか。

他には、株式取引、recommender system, placing ads on a web page, controlling where an image classification system should focus its attentionなどがある。

Reinforcement Learningを適用するのに適したタスクにはどのようなものがあるのか。どのような応用分野があるのかを調査する必要がありそうだ。

Reinforcement Learning, suited tasks, applied areas, ..., 適当なキーワードを用いてGoogle Scholarで検索してみよう。

Reinforcement learning for personalization: A systematic literature review

Floris den Hengst et al., Data Science 3 (2020) 107–147

Abstract.

The major application areas of reinforcement learning (RL) have traditionally been game playing and continuous control. In recent years, however, RL has been increasingly applied in systems that interact with humans. RL can personalize digital systems to make them more relevant to individual users. Challenges in personalization settings may be different from challenges found in traditional application areas of RL. An overview of work that uses RL for personalization, however, is lacking. In this work, we introduce a framework of personalization settings and use it in a systematic literature review. Besides
setting, we review solutions and evaluation strategies. Results show that RL has been increasingly applied to personalization problems and realistic evaluations have become more prevalent. RL has become sufficiently robust to apply in contexts that involve humans and the field as a whole is growing. However, it seems not to be maturing: the ratios of studies that include a comparison or a realistic evaluation are not showing upward trends and the vast majority of algorithms are used only once. This review can be used to find related work across domains, provides insights into the state of the field and identifies opportunities for future work.
強化学習（RL）の主な応用分野は、伝統的にゲームプレイと継続的な制御でした。しかし、近年、RLは人間と相互作用するシステムにますます適用されています。 RLは、デジタルシステムをパーソナライズして、個々のユーザーとの関連性を高めることができます。パーソナライズ設定の課題は、RLの従来のアプリケーション分野で見られる課題とは異なる場合があります。ただし、パーソナライズにRLを使用する作業の概要は不足しています。この作業では、パーソナライズ設定のフレームワークを紹介し、系統的文献レビューで使用します。設定に加えて、ソリューションと評価戦略を確認します。結果は、RLがパーソナライズの問題にますます適用され、現実的な評価がより一般的になっていることを示しています。 RLは、人間が関与するコンテキストに適用するのに十分な堅牢性を備えており、フィールド全体が成長しています。ただし、成熟していないようです。比較または現実的な評価を含む研究の比率は上昇傾向を示しておらず、アルゴリズムの大部分は1回しか使用されていません。このレビューは、ドメイン間で関連する作業を見つけるために使用でき、フィールドの状態への洞察を提供し、将来の作業の機会を特定します。by Google翻訳

The Societal Implications of Deep Reinforcement Learning

Jess Whittlestone et al., Journal of Artificial Intelligence Research 70 (2021) 1003–1030

Abstract
Deep Reinforcement Learning (DRL) is an avenue of research in Artificial Intelligence
(AI) that has received increasing attention within the research community in recent years,
and is beginning to show potential for real-world application. DRL is one of the most
promising routes towards developing more autonomous AI systems that interact with and take actions in complex real-world environments, and can more flexibly solve a range of problems for which we may not be able to precisely specify a correct ‘answer’. This could have substantial implications for people’s lives: for example by speeding up automation in various sectors, changing the nature and potential harms of online influence, or introducing new safety risks in physical infrastructure. In this paper, we review recent progress in DRL, discuss how this may introduce novel and pressing issues for society, ethics, and governance, and highlight important avenues for future research to better understand DRL’s societal implications.

Deep Reinforcement Learning（DRL）は、近年研究コミュニティ内でますます注目を集めている人工知能（AI）の研究手段であり、実際のアプリケーションの可能性を示し始めています。 DRLは、複雑な実世界の環境と相互作用してアクションを実行する、より自律的なAIシステムを開発するための最も有望なルートのひとつであり、正しい答えを正確に特定できない可能性のあるさまざまな問題をより柔軟に解決できます。これは、人々の生活に大きな影響を与える可能性があります。たとえば、さまざまなセクターでの自動化の高速化、オンラインの影響の性質と潜在的な害の変化、物理インフラストラクチャへの新しい安全上のリスクの導入などです。このホワイトペーパーでは、DRLの最近の進歩を確認し、これが社会、倫理、ガバナンスに斬新で差し迫った問題をどのようにもたらすかについて説明し、DRLの社会的影響をよりよく理解するための将来の研究のための重要な手段を強調します。by Google翻訳

5月22日（土）

Bristol-Myers Squibb, 764 teams 12 days to go

トップ50のスコアは日々更新されている。凄いことだ。仲間に入りたいものだ。

Deep Reinforcement Learning：

Reinforcement learning for personalization: A systematic literature review

この論文が何を扱っているのかまだ理解できていない。

Wikipedia : Personalization (broadly known as customization) consists of tailoring a service or a product to accommodate specific individuals, sometimes tied to groups or segments of individuals. A wide variety of organizations use personalization to improve customer satisfaction, digital sales conversion, marketing results, branding, and improved website metrics as well as for advertising. Personalization is a key element in social media and recommender systems.

systematic literature review (SLR)：この論文が、Reinforcement learning for personalizationの内容の論文をレビューしているのだと思ったが、SLRにRLを適用し、personalizationにLRを適用した文献の調査を行ったということなのだろうか。

もう1つの論文を見てみよう。

The Societal Implications of Deep Reinforcement Learning

1. Introduction

Our discussion aims to provide important context and a clear starting point for the AI ethics and governance community to begin considering the societal implications of DRL in more depth.

私たちの議論は、AI倫理およびガバナンスコミュニティがDRLの社会的影響をより深く検討し始めるための重要なコンテキストと明確な出発点を提供することを目的としています。by Google翻訳

この論文も芯を外しているようだ。

2. Deep Reinforcement Learning: a Brief Overview

確かに、簡単な概要だ。共感できるものが無かった。受け取り側が悪いということにしておこう。

Acknowledgementsの手前の、最後のパラグラフの和訳を張り付けて終わろうと思う。

DRLは、AIの将来において重要な役割を果たす可能性が高く、ますます自律的で柔軟なシステムを約束し、ますますハイステークスドメインでのアプリケーションの可能性を秘めています。その結果、DRLは、AIの安全で責任ある使用に関する議論を形作る多くの懸念をもたらし、悪化させ、AIの影響のより一般的な処理では見落とされる可能性があります。社会にとって最も差し迫った課題はAI研究の進歩の正確な性質に依存するため、将来のAIシステムの課題に備えるために、AIの進歩とAIガバナンスに取り組むグループ間の強力なコラボレーションを構築および維持することが重要です。

by Google翻訳

今日も内容の薄いものになってしまった。

A. Geronさんのテキストに戻ろう。

Plycy Search

The algorithm a software agent uses to determine its actions is called its policy. The policy could be a neural network taking observations as inputs and outputting the action to take (see Figure 18-2).

Figure 18-2. Reinforcement Learning using a neural network polycy：この図は何度も見ていたのだが、今日、ようやくこの模式図の意味するところがわかったように思う。

左側にAgentの枠が描かれ、右側にEnvironmentの枠が描かれている。Agentの枠からenvironmentの枠に向かう→はActionsで、EnvironmentからAgentに向かう→はRewarsとObservationsである。よく見かける模式図だが、違うのは、Agentの枠内にヒト形ロボットの模式図とその頭部から雲形の吹き出しが描かれ、吹き出しの中にニューラルネットワークの模式図が描かれていて、入力側のノードはRewardsとObservationsが接続され、出力側のノードはActionsが接続されていることである。

この後にPolycyの説明が図を含めて1ページくらい続く。Agentは掃除ロボットで、stochastic policy, policy space, genetic algorithmsなどの用語の簡単な説明がある。そのあとで、rewardsのpolicy parameterに対する勾配を用いた最適化に言及し、このpolicy gradient (PG)の良く知られたアルゴリズムをTensorFlowで実装すると述べ、agentの居場所であるenvironmentを作るために必要なOpenAI Gymの導入となる。

5月23日（日）

Bristol-Myers Squibb：772 teams, 11 days to go

トップのスコアが0.60になり、チームも入れ替わった。1.00以下が15チーム。

Deep Reinforcement Learningと物理化学

物理や化学への応用事例がないか探していて、最初に目に付いたのが、次の論文。

Structure prediction of surface reconstructions by deep reinforcement learning
Søren A Meldgaard, Henrik L Mortensen, Mathias S Jørgensen and Bjørk Hammer

Published 8 July 2020 • © 2020 IOP Publishing Ltd
Journal of Physics: Condensed Matter, Volume 32, Number 40

Abstract
We demonstrate how image recognition and reinforcement learning combined may be used to determine the atomistic structure of reconstructed crystalline surfaces. A deep neural network represents a reinforcement learning agent that obtains training rewards by interacting with an environment. The environment contains a quantum mechanical potential energy evaluator in the form of a density functional theory program. The agent handles the 3D atomistic structure as a series of stacked 2D images and outputs the next atom type to place and the atomic site to occupy. Agents are seen to require 1000–10 000 single point density functional theory evaluations, to learn by themselves how to build the optimal surface reconstructions of anatase TiO2(001)-(1 × 4) and rutile SnO2(110)-(4 × 1).

Gaussian representation for image recognition and reinforcement learning of atomistic structure

Mads-Peter V Christiansen, Henrik Lund Mortensen, Søren Ager Meldgaard, and Bjørk Hammer, J Chem Phys. 2020 Jul 28;153(4):044107

Abstract
The success of applying machine learning to speed up structure search and improve property prediction in computational chemical physics depends critically on the representation chosen for the atomistic structure. In this work, we investigate how different image representations of two planar atomistic structures (ideal graphene and graphene with a grain boundary region) influence the ability of a reinforcement learning algorithm [the Atomistic Structure Learning Algorithm (ASLA)] to identify the structures from no prior knowledge while interacting with an electronic structure program. Compared to a one-hot encoding, we find a radial Gaussian broadening of the atomic position to be beneficial for the reinforcement learning process, which may even identify the Gaussians with the most favorable broadening hyperparameters during the structural search. Providing further image representations with angular information inspired by the smooth overlap of atomic positions method, however, is not found to cause further speedup of ASLA.

Predictive Synthesis of Quantum Materials by Probabilistic Reinforcement Learning
Pankaj Rajak, Aravind Krishnamoorthy, Ankit Mishra, Rajiv Kalia, Aiichiro Nakano and Priya Vashishta, arXiv.org > cond-mat > arXiv:2009.06739v1, [Submitted on 14 Sep 2020]

Abstract
Predictive materials synthesis is the primary bottleneck in realizing new functional and quantum materials. Strategies for synthesis of promising materials are currently identified by time consuming trial and error approaches and there are no known predictive schemes to design synthesis parameters for new materials. We use reinforcement learning to predict optimal synthesis schedules, i.e. a time-sequence of reaction conditions like temperatures and reactant concentrations, for the synthesis of a prototypical quantum material, semiconducting monolayer MoS2, using chemical vapor deposition. The predictive reinforcement leaning agent is coupled to a deep generative model to capture the crystallinity and phase-composition of synthesized MoS2 during CVD synthesis as a function of time-dependent synthesis conditions. This model, trained on 10000 computational synthesis simulations, successfully learned threshold temperatures and chemical potentials for the onset of chemical reactions and predicted new synthesis schedules for producing well-sulfidized crystalline and phase-pure MoS2, which were validated by computational synthesis simulations. The model can be extended to predict profiles for synthesis of complex structures including multi-phase heterostructures and can also predict long-time behavior of reacting systems, far beyond the domain of the MD simulations used to train the model, making these predictions directly relevant to experimental synthesis.

Learning to grow: control of material self-assembly using evolutionary reinforcement learning
Stephen Whitelam and Isaac Tamblyn, arXiv:1912.08333v3 [cond-mat.stat-mech] 28 May 2020

We show that neural networks trained by evolutionary reinforcement learning can enact efficient molecular self-assembly protocols. Presented with molecular simulation trajectories, networks learn to change temperature and chemical potential in order to promote the assembly of desired structures or choose between competing polymorphs. In the first case, networks reproduce in a qualitative sense the results of previously-known protocols, but faster and with higher fidelity; in the second case they identify strategies previously unknown, from which we can extract physical insight. Networks that
take as input the elapsed time of the simulation or microscopic information from the system are both effective, the latter more so. The evolutionary scheme we have used is simple to implement and can be applied to a broad range of examples of experimental self-assembly, whether or not one can monitor the experiment as it proceeds. Our results have been achieved with no human input beyond the specification of which order parameter to promote, pointing the way to the design of synthesis protocols by artificial intelligence.

Generative Adversarial Networks for Crystal Structure Prediction
Sungwon Kim, Juhwan Noh, Geun Ho Gu, Alan Aspuru-Guzik, and Yousung Jung ACS Cent. Sci. 2020, 6, 1412−1420

ABSTRACT: The constant demand for novel functional materials calls for efficient strategies to accelerate the materials discovery, and crystal structure prediction is one of
the most fundamental tasks along that direction. In addressing this challenge, generative models can offer new opportunities since they allow for the continuous navigation of chemical space via latent spaces. In this work, we employ a crystal representation that is inversion-free based on unit cell and fractional atomic coordinates and build a generative adversarial network for crystal structures. The proposed model is applied to generate the Mg−Mn−O ternary materials with the theoretical evaluation of their photoanode properties for high-throughput virtual screening (HTVS). The proposed generative HTVS framework predicts 23 new crystal structures with reasonable calculated stability and band gap. These findings suggest that the generative model can be an effective way to explore hidden portions of the chemical space, an area that is usually unreachable when conventional substitutionbased discovery is employed.

いくつかの論文を眺めてみて、Deep Reinforcement Learning (DRL) が自然科学においても非常に重要な択割を果たしており、急激に拡がりつつあることがわかった。

A. Geronさんのテキストに戻って、DRLを学ぼう。

5月24日（月）

Bristol-Myers Squibb：781 teams, 10 days to go

ケロッピアバターのGMのスレッドを眺めてみた。

関連情報収集力、集約力、再構成力、等々に感心するばかり。

Reinforcement Learning：A. Geronさんのテキストを用いた学習

Introduction to OpenAI Gym：食わず嫌いだったOpenAI Gym

agentのトレーニングには、working environmentが必要。そのためのsimulated environmentを提供するのがOpenAI Gym (Atari games, board games, 2D and 3D physical simulations, and so on)。

さて、テキストでは、OpenAI Gymをインストールする方法が説明されている。

次に、CartPoleを動かしながらの説明が続いている。

import gym

env = gym.make("CartPole-v1")

obs = env.reset( )

obs

array([-0.01258566, -0.00156614, 0.04207708, -0.00180545])

CartPileは、2D physical simulationということのようだ。

obsは1D NumPy arrayで、4つの数字が、obs[0]: cart's horizontal position (0.0 = center), obs[1]: its velocity (positive means right), obs[2]: the angle of the pole (0.0 = vertical), and obs[3]: its angular velocity (positive means clockwise)の順に並んでいる。

CartPoleなんて面白くもなんともないから気が入らない。と思いながらしぶしぶテキストを眺めている。

>>> env.action_space

Discrete(2)

actionは2通り、integerで0と1、0はleftに加速、1はrightに加速。

>>> action = 1 # accelerate right

>>> obs, reward, done, info = env.step(action)

>>> obs

array([-0.01261699, 0.19292789, 0.04204097, -0.028092127])

>>> reward

1.0

>>> done

False

>>> info

{ }

The step( ) method exacutes the given action and returns four values:

obs:

This is the new observation. The cart is now moving toward the right (obs[1] > 0). The pole is still tilted toward the right (obs[2] >0), but its angular velosity is now negative (obs[3] <0), so it will likely be tilted toward the left after the next step.

reward:

In this environment, you get a reward of 1.0 at every step, no matter what you do, so that the goal is to keep the episode running as long as possible.

done:

This value will be True when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps (in this last case, you have won). After that, the environment must be reset before it can be used again.

info:

This environment-specific dictionary can provide some extra information that you may find useful for debugging or for training. For example, in some games it may indicate how many lives the agent has.

まずは、単純なポリシーの場合：ポールが右側に倒れていたらcartを右側に加速し、ポールが左側に倒れていたらcartを左側に加速するというもの。

def basic_policy(obs):

angle = obs[2] # angle of the pole

return 0 if angle < 0 else 1

totals = [ ]

for episode in range(500):

episode_rewarda = 0

obs = env.reset( )

for step in range(200):

action = basic_policy(obs)

obs, reward, done, info = env.step(action)

episode_rewards += reward

if done:

break

totals.append(episode_rewards)

これを実行してみると、次のような結果となった。

>>> import numpy as np

>>> np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(41.718, 8.858356280936096, 24.0, 68.0)

poleが倒れなければ（ポールの傾きがある設定角度以下、かつ、カートが画面の中にある場合）、1 stepあたりrewardは1.0である。結果を見るとポールの保持に成功した平均step回数は41.7、標準偏差は8.9、最小ステップ数は24、最大ステップ数は68となったということである。

68 step以上の間、ポールを保持できなかったということである。neural networkを使えばpolicyを改善することができるということが次に示される。

Neural Network Policies

いよいよニューラルネットワークが登場する。ニューラルネットワークにobservationを入力し、実行するactionを出力する。このとき出力されるのはactionの確率である。この確率に従ってactionをランダムに決める。

5月25日（火）

Bristol-Myers Squibb：782 teams, 9 days to go

今日も、ケロッピアバターのGMのスレッドを眺めてみた。使われている単語の意味がわからないものばかり。いくつかを調べてみた。

Fairseq is a sequence modeling toolkit written in PyTorch that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

pre-norm activation transformer:

Transformers without Tears: Improving the Normalization of Self-Attention

Toan Q. Nguyen and Julian Salazar, arXiv:1910.05895v2 [cs.CL] 30 Dec 2019

Abstract
We evaluate three simple, normalizationcentric changes to improve Transformer
training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose `2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT '15 EnglishVietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the highresource setting (WMT '14 English-German), SCALENORM and FIXNORM remain competitive but PRENORM degrades performance.

Transformerトレーニングを改善するために、3つの単純な正規化中心の変更を評価します。まず、ノルム前の残余接続（PRENORM）と小さな初期化により、ウォームアップのない検証ベースのトレーニングが大きな学習率で可能になることを示します。次に、トレーニングを高速化し、パフォーマンスを向上させるために、単一のスケールパラメータ（SCALENORM）を使用した `2正規化を提案します。最後に、単語の埋め込みを固定長（FIXNORM）に正規化することの有効性を再確認します。 TED Talksベースのコーパスからの5つの低リソース翻訳ペアでは、これらの変更は常に収束し、最先端のバイリンガルベースラインを平均+1.1 BLEU、IWSLT '15EnglishVietnameseで新しい32.8BLEUを提供します。よりシャープなパフォーマンス曲線、より一貫性のある勾配基準、およびアクティベーションスケーリングとデコーダー深度の間の線形関係を観察します。驚いたことに、高リソース設定（WMT '14英語-ドイツ語）では、SCALENORMとFIXNORMは引き続き競争力がありますが、PRENORMはパフォーマンスを低下させます。by Google翻訳

次の2つの論文は、基礎知識。

Show and Tell: A Neural Image Caption Generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan
arXiv:1411.4555v2 [cs.CV] 20 Apr 2015

Abstract
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this
paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the
current state-of-the-art.

画像の内容を自動的に記述することは、コンピュータービジョンと自然言語処理を結び付ける人工知能の基本的な問題です。この論文では、コンピュータビジョンと機械翻訳の最近の進歩を組み合わせ、画像を説明する自然な文を生成するために使用できる、ディープリカレントアーキテクチャに基づく生成モデルを提示します。モデルは、トレーニング画像が与えられた場合にターゲットの説明文の可能性を最大化するようにトレーニングされます。いくつかのデータセットでの実験は、モデルの正確さと、画像の説明からのみ学習する言語の流暢さを示しています。私たちのモデルはしばしば非常に正確であり、定性的および定量的に検証します。たとえば、Pascalデータセットの現在の最先端のBLEU-1スコア（高いほど良い）は25ですが、私たちのアプローチでは59が得られ、69前後の人間のパフォーマンスと比較されます。BLEU-1も示しています。 Flickr30kのスコアが56から66に、SBUのスコアが19から28に向上しました。最後に、新しくリリースされたCOCOデータセットで、現在の最先端のBLEU-4である27.7を達成しました。by Google翻訳

f:id:AI_ML_DL:20210525120808p:plain

f:id:AI_ML_DL:20210525120945p:plain

f:id:AI_ML_DL:20210525121037p:plain

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu et al., arXiv:1502.03044v3 [cs.LG] 19 Apr 2016

Abstract
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower
bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

機械翻訳とオブジェクト検出の最近の研究に触発されて、画像の内容を説明することを自動的に学習する注意ベースのモデルを紹介します。標準的なバックプロパゲーション手法を使用して、変分下限を最大化することにより確率的にこのモデルを決定論的にトレーニングする方法について説明します。また、視覚化を通じて、出力シーケンスで対応する単語を生成しながら、モデルが顕著なオブジェクトを注視することを自動的に学習する方法を示します。 Flickr8k、Flickr30k、MS COCOの3つのベンチマークデータセットで、最新のパフォーマンスを使用してattentionの使用を検証します。　　　　　　by Google翻訳

f:id:AI_ML_DL:20210525122337p:plain

f:id:AI_ML_DL:20210525122541p:plain

Reinforcement Learning
Neural Network Policies

tf.kerasを用いたneural network policyの構築例

inport tensorflow as tf

from tensorflow inport keras

n_inputs = 4 # == env.observation_space.shape[0]

model = keras.models.Sequential([

keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),

keras.layers.Dense(1, activation="sigmoid"),

])

入力層のユニット数は4で、入力するのは、observationsで、position, velocity, angle, angular velocityの4種類である。

hidden layerのユニット数は、簡単な課題なので5としている。活性化関数は"elu"。

出力は、action 0 (left)の確率だけなのでユニット数は1。活性化関数は、出力が確率なので"sigmoid"。

これでneural network policyが構築できた。

入力ユニット数が4、隠れ層のユニット数が5、出力層が1のニューラルネットワークの出来上がり。

先のhardcodeとの違いを認識することが重要：hardcodeは、angle（poleの傾斜角）の極性からactionを決めた。これに対してneural network policyは、台車の位置（0.0が画面の中央）、台車の速度（右方向を正）、ポールの傾斜角（垂直が0.0で時計方向＝右に膾炙している場合が正）、角速度（≒ポールが倒れる速さ、時計回りが正）の4つの値の非線形関数から推定した確率からactionを決める。

どうやってtrainingするのだろうか。

Evaluating Actions: The Credit Assighnment Problem（貢献度分配問題）

通常のsupervised learningはできない。先に示したhardcodeのプログラムで、平均以上のステップ数までポールを保持できたエピソードと平均以下のステップ数しか保持できなかった場合について、何をどう学ばせることができるのか。ポールが傾いている方向に台車を加速しているだけなのでより良い条件をどのようにして見つけるのだろうか。

discount factorなるものを導入して思考実験している。積算するrewardの回数と減衰の大きさをコントロールする因子のようである。discount factorが0.95では、13ステップでrewardが半分に、discount factorが0.99では69ステップでrewardが半分になる。

次にgood actionとbad actionが含まれる割合を比較することによってaction advantageを定義すれば、各actionをpositive advantageとnrgative advantageに分類することができる。これで、actionの評価(evaluation)ができたということになるようだ。

理解不十分につき、文章がつながっていないし、意味不明。要検討。

Figure 18-6. Computing an action's return: the sum of discounted future rewardsが、理解できないので、ここで、停止している。

この節の最後は次のように締めくくられている。

Perfect-now that we have a way to evaluate each action, we are ready to train our first agent using policy gradients. Let's see how.

5月26日（水）

Bristol-Myers Squibb：791 teams, 8 days to go

今日は、Deep Reinfoecement Learningの学習を優先しよう。

Deep Reinfoecement Learning

Evaluating Actions: The Credit Assighnment Problem：

本節の1行目から読み直そう。

日本語に訳してもよくわからないので、テキストから抜き書きする。抜き書きでなく、全文になりそうな気配。なぜなら、1行でもとばしたら意味がわからなくなりそうだ。

If we knew what the best action was at each step, we could train the neural network as usual, by minimizing the cross entropy between the estimated probability distribution and the target probability distribution.

It would just be regular supervised learning.

However, in Reinforcement Learning the only guidance the agent gets is through rewards, and rewards are typically sparce and delayed.

For example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad?

All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible.

This is called the credit assignment problem: when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it.

Think of a dog that gets rewarded hours after it behaved well: will it understand what it is being rewarded for?

cart上のpoleが倒れないようにcartを左右に動かす。2次元の大道芸だな。

To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor ɤ (gamma) at each step.　discount gactorの意味、役割がわからんな。

This sum of discounted rewards is called the action's return. ここでのactionは、1 stepではなく、一連のstepの集合体を指しているようだ。

consider the example in Figure 18-6.

If an agent decides to go right three times in a row and gets +10 reward after the first step, 0 after the second step, and finally -50 after the third step, then assume we use a discount factor ɤ=0.8, the first action will have a return of 10 + ɤ x 0 + ɤ^2 x (-50) = -22.

ここで頭が混乱する。p.616のrewardの説明では、全てのステップで1.0のrewardを得ると書いてある。ところが、ここでは、rightへのstepを3回繰り返し、最初のstepから10のrewars、2番目のstepから0のreward、3番目のstepから-50のrewardを得ると書いてある。よく考えよう。

5月27日（木）

Bristol-Myers Squibb：799 teams, 7 days to go

あと1週間になった。終わってから猛勉強だな。

画像からテキストへの変換だが、分子構造画像からテキスト、図形からテキスト、数式からテキスト、関数からテキスト、方程式からテキスト、物理の理論式からテキスト、化学反応式からテキスト、タンパクの構造式からテキスト、・・・

人の五感に相当するセンサーを備え、自律的に学習し、行動する人工人間を、デザインしてみよう。目標はナンバーファイブだ。

WIKIPEDIA: Short Circuit (1986 film)

Reinforcement Learning：A. Geronさんのテキスト第18章: Reinforcement Learning

If the discount factor is close to 0, then future rewards won't count for much compared to immediate rewards.　

Conversely, if the discount factor is close to 1, then rewards far into the future will count almost as much as immediate rewards.　

Typical discount factors vary from 0.9 to 0.99.

With a discount factor of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards (since 0.95^13≒0.5), while with a discount factor of 0.99, rewards 69 steps into the future count for half as much as immediate rewards.　

In the CartPole environment, actions have fairly short-term effects, so choosing a discount factor of 0.95 seems reasonable.　

Figure 18-6. Computing an action's return: the sum of discounted future rewards

Of cource, a good action may be followed by several bad actions that cause the pole to fall quickly, resulting in the good action getting a low return (similarly, a good actor may sometimes star in a terrible movie).

However, if we play the game enough times, on average good actions will get a higher return than bad ones.

We want to estimate how much better or worse an action is, compared to the other possible actions, on average.

This is called the action average.

For this, we must run many episodes and normalize all the action returns (by subtracting the mean and dividing by the standard deviation).

After that, we can reasonably assume that actions with a negative advantage were bad while actions with a positive advantage were good.

Perfect - now that we have a way to evaluate each action, we are ready to train our first agent using polycy gradients.

Let's see how.

Policy Gradients

As discusses earlier, PG algorithms optimize the parameters of a polycy by following the gradients toward higher rewards.

One popular class of GP algorithms, called REINFORCE algorithms, was introduced back in 1992 (https://homl.info/132) by Ronald Williams.

Here is one common variant:

5月28日（金）

Bristol-Myers Squibb：806 teams, 6 days to go

進捗なし。

5月29日（土）

Bristol-Myers Squibb：816 teams, 6 days to go

進捗なし。

5月31日（月）

Bristol-Myers Squibb：831 teams, 4 days to go

Bristol-Myers Squibbをメインに進めていたが、うまくいかず、途中でReinforcement Learningの勉強に切り替える等、何をやってるのかわからなくなった。

trainの公開コードを走らせ、チューニングすることで、少しでもコードを理解しよう。

公開コードを走らせる前にAttentionとTransformをきちんと理解しようと思って記事を書いていたのだが、途中で、ブラウザがリセットされ、書きかけの文章が全て消えてしまった。

斎藤康毅著ゼロから作るDeep Learning 2 自然言語処理編 8章　Attentionから少し引用させていただこう。

8.5.2 Transformer

　私たちはこれまで、RNN（LSTM）をいたるところで使用してきました。言語モデルに始まり、文章生成、seq2seq、そしてAttection付きseq2seqと、その構成要素には必ずRNNが登場しました。そして、このRNNによって、可変長の時系列データはうまく処理され、（多くの場合）良い結果を得ることができます。しかし、RNNにも欠点があります。その欠点のひとつに、並列化処理が挙げられます。

　RNNは、前時刻に計算した結果を用いて逐次的に計算を行います。そのため、RNNの計算を、時間方向で並列的に計算することは（基本的には）できません。この点は、ディープラーニングの計算がGPUを使った並列計算の環境で行われることを想定すると、大きなボトルネックになります。そこで、RNNを避けたいというモチベーションが生まれます。

　そのような背景から、現在ではRNNを取り除く研究（もしくは並列計算可能なRNNの研究）が活発に行われています。これは『Attention is all you need』というタイトルの論文で提案された手法です。そのタイトルが示すとおり、RNNではなくAttentionを使って処理します。ここでは、このTransformerについて簡単に見ていきます。

　TransformerはAttentionによって構成されますが、その中でもSelf-Attentionというテクニックが利用される点が重要なポイントです。このSelf- Attentionは直訳すれば「自分自身に対してのAttention」ということになります。つまりこれは、ひとつの時系列データを対象としたAttentionであり、ひとつの時系列データ内において各要素が他の要素に対してどのような関連性があるのかを見ていこうというものです。私たちのTime Attentionレイヤを使って説明すると、Self-Attentionは図8-37のように書けます。

　これまで私たちは「翻訳」のような2つの時系列データ間の対応関係をAttentionで求めてきました。このときTime Attentionレイヤへの2本の入力には、図8-37の左図で示すように、異なる2つの時系列データ（hs_enc, hs_dec）が入力されます。これに対してSelf-Attentionは、図8-37の右図に示すように、2本の入力線にひとつの時系列データ（hs）が入力されます。そうすることで、ひとつの時系列データ内において各要素間の対応関係が求められます。

　Self-Attentionの説明が済んだので、続いてTransformerのレイヤ構成を見ていただきましょう。Transformerの構成は図8-38のようになります。

　Transformerでは、RNNの代わりにAttentionが使われます。実際に図8-38を見ると、EncoderとDecoderの両者でSelf-Attentionが使われていることがわかります。なお、図8-38のFeed Forwardレイヤは、フィードフォワードのネットワーク（時間方向に独立して処理するネットワーク）を表します。具体的には、隠れ層が1層で活性化関数にReLUを用いた全結合のニューラルネットワークが用いられています。また、図中に「Nx」とありますが、これは背景がグレーで囲まれた要素をN回積み重ねることを意味します。

　このTransformerを用いることで、計算量を抑え、GPUによる並列計算の恩恵をより多く受けることができます。その成果として、TransformerはGNMTに比べて学習時間を大幅に減らすことに成功しました。さらに翻訳精度の点でも、図8-39が示すように、精度向上を実現しました。

　図8-39では、3つの手法が比較されています。その結果は、GNMTよりも「畳み込み層を用いたseq2seq」（図中ではConvS2Sで表記）が精度が高く、さらにTransformerはそれをも上回っています。このように、Attentionは、計算量だけでなく、精度の観点からも有望な技術であることがわかります。

　わたしたちは、これまでAttentionをRNNと組み合わせて利用してきました。しかし、ここでの研究が示唆するように、AttentionはRNNを置き換えるモジュールとしても利用できるのです。これによって、さらにAttentionの利用機会が増えていくかもしれません。　

　以上が、斎藤康毅著ゼロから作るDeep Learning 2 自然言語処理編 8章　AttentionからのTransformerに関する引用。

Bristol-Myers Squibbのtrainの公開コードは、明日走らせてみる予定。

6月1日（火）

Bristol-Myers Squibb：841 teams, 3 days to go

50位のスコアが1.42とは、凄い人達が集まっているんだ。

公開コード（train）を走らせてみる：

2週間くらいKaggleでcodeを走らせていなかったので、少し、とまどった。

いま走らせているコード”Tensorflow TPU Training Baseline LB 16.92”の動作状況とコードを見ていて思うのは、この2週間を無駄に過ごしたということ。2週間前には、このコードや、このコードの作者が参考にしたコードを学ぼうと計画していたのに、コンペの大通りを離れて脇道に入り、reinforcement learning広場で2週間も遊んでしまったことが悔やまれる。

TPUを使うには、データセットをTFRecordに書き換えるところから始める必要がある。

”Advanced Image Cleaning and TFRecord Generation”

画像を綺麗にする（点状のノイズ消去、線の欠けの修正）こと、画像の不要な外側の空白領域を削除すること、および、TFRecordへの変換を行っている。

画像のサイズを決めるのも容易ではない。作成者はつぎのように記述している。

Choosing an appropriate image size is difficult, as complex molecules will need a high resolution to preserve details, but training on 2.4 million in high resolution is unfeasable. The chosen resolution of 256*448 should preserve enough detail and allow for training on a TPU within the 3 hours limit.　

240万件の画像データを3時間以内に処理しなければならないという条件を満たしたうえで、画像解像度を決めなければならない。trainコードを見ると、1エポックで2.4Mではなく512kのデータをtrainingしている。このとき、何も考えずにやると、1部のデータしか使わないことになる。公開コードに対する質疑から、全データをtrainingできるように途中で改善したとのこと。これを見て、HuBMAPコンペでは、全部のtrainデータを使えず、スコアが上がらなかったのを思い出した。このときは、train_dataを読み込んでからtrainingをスタートする必要があり、メモリーオーバーで1/3程度のtrain_dataが読み込めなかった。しかし、自分でも、train_dataに偏りがあり、全てのtrain_dataを使わないと予測精度が上がらないことは常識なので、対応策を絞り出すべきだったと思う。たとえば、train_dataの2/3の読み込み時間は3分間くらいだから、10エポックのtrainingであれば、エポック毎に異なる組み合わせのtrain_dataを3分かけて読み込んでも30分くらい長くなるだけで全ての訓練データを使うことができて、スコアアップを図れた可能性がある。

画像を綺麗にする（点状のノイズ消去、線の欠けの修正）には、たとえば次のようなコードを用いている。

# single pixel width horizontal line with 1 pixel missing
kernel_h_single_mono = pad_kernel( a, a, a, -1, a, a, a , max_pad=1)

# single pixel width horizontal line with 3 pixels missing
kernel_h_single_triple = pad_kernel( a, a, a, -1, -1, -1, a, a, a , max_pad=1)

kernel_h_multi = pad_kernel([
[ a, a, a, a, a, a, a ],
[ a, a, a,-1, a, a, a ],
[ a, a, a, a, a, a, a ],
], max_pad=1)

こういうのをANNでやりたいと思ったのだが、こういう技術も使えるようになっておくことは重要だと思う。

InChIコードに対して、inchi_intを定義しているようだが、コードを読むことができない。

遊び半分でエポック数を10、20、30と変えて訓練し、訓練済みのパラメータをデータセットに入れて、inferenceプログラムを走らせると、LBスコアは、それぞれ、11.36、8.23、7.85となった。

これ以上どうなるものでもなさそうだが、とりあえず、あと2日間、discussionや他の公開コードを眺めながら、スコアアップの方法を試してみる。

6月2日（水）

Bristol-Myers Squibb：855 teams, 2 days to go

バッチサイズを512から1024にしてみよう。

学習率lrを2e-3～1e-4から1e-3～1e-5にしてみよう。

LBスコアは、9.19と悪化した。

明日は、最終日だが、これで終わる。

このコンペは、失敗であった。

6月4日（金）

先ほど、Bristol-Myers Squibbコンペの最終結果が出た。コードコンペではないので、結果が即日確定！順位変動も殆どなし。

反省の弁

１．コンペに参加する意義は、スコアアップ以外にはない。途中でreinforcement learningに向かったのは、generativeなモデルがスコアアップに必要だと感じたためであるが、reinforcement learningに向かったのは失敗であった。しかも、reinforcement learningの学習中は、集中力を欠いていた。

２．課題が自分の頭では解けなかった（分子構造式とInChIコードの対応関係を最後まで把握できなかった）ために、モデル構築の方向性を見出せず、スコアアップの方向に舵を切ることができなかった。コンペのdiscussionや公開コードをよく見て対策を考えるべきだった。

３．金や銀を意識することは、モチベーションを上げるために重要であるが、今回は、メダル圏内に入ることすら早々にあきらめてしまったために、あらぬ方向に行ってしまい、公開コードのチューニングすら殆ど行わなかった。けろっぴアバターの方が公開していたコードを試すことは、やるべきだったと思う。