理解與緩解機器學習中的分佈偏移力場

摘要

機器學習力場（MLFFs）作為一種替代昂貴的從頭算量子力學分子模擬的有前景方法，鑑於感興趣的化學空間多樣性及生成新數據的高成本，理解MLFFs如何在其訓練分佈之外進行泛化顯得尤為重要。為了表徵並更好地理解MLFFs中的分佈偏移，我們在化學數據集上進行了診斷性實驗，揭示了即使對於基於大量數據訓練的大型基礎模型而言，也構成顯著挑戰的常見偏移。基於這些觀察，我們假設當前的監督訓練方法未能充分正則化MLFFs，導致過擬合並學習到對分佈外系統的不良表徵。隨後，我們提出了兩種新方法作為緩解MLFFs分佈偏移的初步步驟。我們的方法聚焦於測試時的精煉策略，這些策略計算成本低且不使用昂貴的從頭算參考標籤。第一種策略基於譜圖理論，通過修改測試圖的邊緣以與訓練期間觀察到的圖結構對齊。我們的第二種策略通過使用輔助目標（如廉價的物理先驗）進行梯度步進來改善測試時對分佈外系統的表徵。我們的測試時精煉策略顯著降低了分佈外系統上的誤差，表明MLFFs有能力並可以朝著模擬多樣化化學空間的方向發展，但目前的訓練方式並未有效引導其實現這一目標。我們的實驗為評估下一代MLFFs的泛化能力建立了明確的基準。我們的代碼可在https://tkreiman.github.io/projects/mlff_distribution_shifts/獲取。

English

Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.

理解與緩解機器學習中的分佈偏移力場

Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

摘要

Support