ChatPaper.aiChatPaper

邁向原子屬性預測的數據高效預訓練

Towards Data-Efficient Pretraining for Atomic Property Prediction

February 16, 2025
作者: Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI

摘要

本文挑戰了近期原子屬性預測領域中將進展與日益增長的數據集規模和計算資源相連結的範式。我們證明,在精心挑選、與任務相關的數據集上進行預訓練,不僅能匹配甚至超越大規模預訓練的效果,同時僅需使用1/24的計算成本。我們引入了化學相似性指數(CSI),這是一種受計算機視覺中Fr\'echet Inception Distance啟發的新指標,用於量化分子圖譜的上游預訓練數據集與下游任務之間的對齊程度。通過選擇CSI距離最小的最相關數據集,我們展示了在較小、聚焦的數據集上預訓練的模型,其表現始終優於在如JMP等大規模混合數據集上預訓練的模型,即便這些大型數據集包含了相關數據。反直覺的是,我們還發現,當額外數據與手頭任務對齊不佳時,不加選擇地增加數據反而會降低模型性能。我們的研究結果強調,在原子屬性預測的預訓練中,質量往往勝過數量。
English
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr\'echet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

Summary

AI-Generated Summary

PDF33February 18, 2025