機械学習のための分布シフトの理解と緩和フォースフィールド

要旨

機械学習力場（MLFFs）は、高価なab initio量子力学的分子シミュレーションに代わる有望な手法です。関心のある化学空間の多様性と新たなデータ生成のコストを考慮すると、MLFFsがその訓練分布を超えてどのように一般化するかを理解することが重要です。MLFFsにおける分布シフトを特徴づけ、より深く理解するために、化学データセットに対して診断実験を行い、大規模なデータで訓練された基盤モデルであっても、重要な課題となる一般的なシフトを明らかにしました。これらの観察に基づき、現在の教師あり訓練方法ではMLFFsを適切に正則化できず、過剰適合や分布外システムの不適切な表現学習が生じていると仮説を立てました。そこで、MLFFsの分布シフトを軽減するための初期段階として、2つの新しい手法を提案します。私たちの手法は、計算コストを最小限に抑え、高価なab initio参照ラベルを使用しないテスト時改良戦略に焦点を当てています。最初の戦略は、スペクトルグラフ理論に基づいて、テストグラフのエッジを訓練中に見られたグラフ構造に合わせて修正します。2番目の戦略は、安価な物理的プライアなどの補助目的関数を使用して勾配ステップを取ることで、テスト時に分布外システムの表現を改善します。私たちのテスト時改良戦略は、分布外システムにおける誤差を大幅に減少させ、MLFFsが多様な化学空間をモデル化する能力を持ち、その方向に向かうことができるが、それを効果的に訓練されていないことを示唆しています。私たちの実験は、次世代のMLFFsの一般化能力を評価するための明確なベンチマークを確立します。コードはhttps://tkreiman.github.io/projects/mlff_distribution_shifts/で公開されています。

English

Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.

機械学習のための分布シフトの理解と緩和フォースフィールド

Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

要旨

Support