分子性质预测的BERT机理研究

摘要

化学语言模型（CLM）已成为分子性质预测（MPP）任务中传统经典机器学习模型的有力竞争者。然而，越来越多研究报道CLM在不同MPP基准任务中的性能存在不一致甚至相互矛盾的结果。本研究通过数百项精密控制的实验，系统考察了数据集规模、模型体量及标准化等因素对CLM在MPP任务中预训练与微调性能的影响。针对目前尚无明确标度律适用于仅编码器的掩码语言模型这一现状，我们的目标是提供全面的数值证据，并深入理解影响CLM在MPP任务性能的内在机制——其中某些机制似乎已被现有文献完全忽视。

English

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.