原子特性予測のためのデータ効率的な事前学習に向けて

要旨

本論文は、原子特性予測における最近のパラダイム、すなわち進歩がデータセットの規模と計算リソースの増大に結びついているという考え方に異議を唱えるものである。我々は、タスクに関連した注意深く選ばれたデータセットで事前学習を行うことで、大規模な事前学習と同等かそれ以上の性能を達成しつつ、計算コストを1/24まで削減できることを示す。また、分子グラフにおいて上流の事前学習データセットと下流タスクの整合性を定量化する新しい指標として、コンピュータビジョンのFr\'echet Inception Distanceに着想を得たChemical Similarity Index（CSI）を提案する。CSI距離が最小となる最も関連性の高いデータセットを選択することで、JMPのような大規模で混合されたデータセットで事前学習したモデルよりも、より小さく焦点を絞ったデータセットで事前学習したモデルの方が一貫して優れた性能を発揮することを示す。直感に反して、タスクと整合性の低いデータを無差別に追加すると、モデルの性能が低下することも明らかになった。我々の知見は、原子特性予測における事前学習において、量よりも質が重要であることを強調している。

English

This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

原子特性予測のためのデータ効率的な事前学習に向けて

Towards Data-Efficient Pretraining for Atomic Property Prediction

要旨

Support