分子性质预测的BERT学解析

摘要

化学语言模型（CLM）已成为分子性质预测（MPP）任务中传统经典机器学习模型的有力竞争者。然而，越来越多的研究报道CLM在不同MPP基准任务中的性能存在不一致甚至相互矛盾的结果。本研究通过数百项精密控制的实验，系统考察了数据集规模、模型体量和标准化等因素对CLM在MPP任务中预训练与微调性能的影响。针对目前仅编码器掩码语言模型缺乏成熟扩展规律的现状，我们旨在提供全面的数值证据，并深入解析影响CLM在MPP任务性能的内在机制——其中某些机制在现有文献中似乎被完全忽视。

English

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.

分子性质预测的BERT学解析

BERTology of Molecular Property Prediction

摘要

Support