머신러닝을 위한 분포 변화 이해 및 완화 포스 필드

초록

머신 러닝 포스 필드(MLFFs)는 고비용의 초기 양자 역학 분자 시뮬레이션을 대체할 수 있는 유망한 기술입니다. 다양한 화학 공간의 중요성과 새로운 데이터 생성의 비용을 고려할 때, MLFFs가 훈련 분포를 넘어 어떻게 일반화되는지 이해하는 것이 중요합니다. MLFFs의 분포 변화를 특성화하고 더 잘 이해하기 위해, 우리는 화학 데이터셋에 대한 진단 실험을 수행하여, 광범위한 데이터로 훈련된 대형 기초 모델에게도 상당한 도전을 제기하는 일반적인 변화를 밝혔습니다. 이러한 관찰을 바탕으로, 우리는 현재의 지도 학습 방법이 MLFFs를 충분히 규제하지 못해 과적합이 발생하고 분포 외 시스템에 대한 부적절한 표현을 학습한다는 가설을 세웠습니다. 이에 따라, 우리는 MLFFs의 분포 변화를 완화하기 위한 초기 단계로서 두 가지 새로운 방법을 제안합니다. 우리의 방법은 최소한의 계산 비용을 요구하며 고비용의 초기 참조 레이블을 사용하지 않는 테스트 시점 정제 전략에 초점을 맞춥니다. 첫 번째 전략은 스펙트럼 그래프 이론을 기반으로 테스트 그래프의 에지를 훈련 중에 본 그래프 구조와 일치하도록 수정합니다. 두 번째 전략은 저렴한 물리적 사전 조건과 같은 보조 목적 함수를 사용하여 테스트 시점에 분포 외 시스템에 대한 표현을 개선합니다. 우리의 테스트 시점 정제 전략은 분포 외 시스템에서의 오류를 크게 줄여, MLFFs가 다양한 화학 공간을 모델링할 수 있고 이를 향해 나아갈 수 있지만, 이를 효과적으로 훈련받지 못하고 있음을 시사합니다. 우리의 실험은 차세대 MLFFs의 일반화 능력을 평가하기 위한 명확한 벤치마크를 확립합니다. 우리의 코드는 https://tkreiman.github.io/projects/mlff_distribution_shifts/에서 확인할 수 있습니다.

English

Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.

머신러닝을 위한 분포 변화 이해 및 완화 포스 필드

Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

초록

Support