Begrijpen en Mitigeren van Distributieverschuivingen voor Machine Learning Krachtvelden

Samenvatting

Machine Learning Force Fields (MLFFs) vormen een veelbelovend alternatief voor kostbare ab initio kwantummechanische moleculaire simulaties. Gezien de diversiteit van chemische ruimtes die van belang zijn en de kosten van het genereren van nieuwe data, is het belangrijk om te begrijpen hoe MLFFs generaliseren buiten hun trainingsdistributies. Om distributieverschuivingen in MLFFs te karakteriseren en beter te begrijpen, voeren we diagnostische experimenten uit op chemische datasets, waarbij we veelvoorkomende verschuivingen blootleggen die aanzienlijke uitdagingen vormen, zelfs voor grote foundationmodellen die getraind zijn op uitgebreide data. Op basis van deze observaties stellen we de hypothese op dat huidige supervised trainingsmethoden MLFFs onvoldoende regulariseren, wat leidt tot overfitting en het aanleren van slechte representaties van out-of-distributie systemen. Vervolgens stellen we twee nieuwe methoden voor als eerste stappen om distributieverschuivingen voor MLFFs te mitigeren. Onze methoden richten zich op test-time verfijningsstrategieën die minimale rekenkosten met zich meebrengen en geen gebruik maken van dure ab initio referentielabels. De eerste strategie, gebaseerd op spectrale grafentheorie, past de randen van testgrafieken aan om deze af te stemmen op grafstructuren die tijdens de training zijn gezien. Onze tweede strategie verbetert representaties voor out-of-distributie systemen tijdens test-time door gradientstappen te nemen met behulp van een hulpdoel, zoals een goedkope fysische prior. Onze test-time verfijningsstrategieën verminderen de fouten op out-of-distributie systemen aanzienlijk, wat suggereert dat MLFFs in staat zijn om diverse chemische ruimtes te modelleren en hiernaar kunnen evolueren, maar dat ze niet effectief worden getraind om dit te doen. Onze experimenten stellen duidelijke benchmarks vast voor het evalueren van de generalisatiecapaciteiten van de volgende generatie MLFFs. Onze code is beschikbaar op https://tkreiman.github.io/projects/mlff_distribution_shifts/.

English

Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at https://tkreiman.github.io/projects/mlff_distribution_shifts/.

Begrijpen en Mitigeren van Distributieverschuivingen voor Machine Learning Krachtvelden

Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

Samenvatting

Support