分子特性予測のBERTology

要旨

化学言語モデル（CLM）は、分子特性予測（MPP）タスクにおいて、従来の古典的機械学習モデルに対する有望な競合技術として登場している。しかし、様々なMPPベンチマークタスクにおけるCLMの性能に関して、一貫性のない矛盾した結果が増加する研究によって報告されている。本研究では、MPPにおけるCLMの事前学習およびファインチューニング性能に及ぼす、データセットサイズ、モデルサイズ、標準化などの様々な要因の影響を体系的に調査するため、数百に及ぶ厳密に管理された実験を実施し分析する。エンコーダのみのマスク言語モデルに対する確立されたスケーリング則が存在しない現状において、我々の目的は、MPPタスクにおけるCLMの性能に影響を与える根本的なメカニズムに関する包括的な数値的証拠とより深い理解を提供することである。これらのメカニズムの一部は、関連文献において完全に見落とされているように思われる。

English

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.

分子特性予測のBERTology

BERTology of Molecular Property Prediction

要旨

Support