분자 특성 예측의 BERTology

초록

화학 언어 모델(CLM)은 분자 특성 예측(MPP) 과제에서 기존의 전통적인 머신러닝 모델에 대한 유력한 대안으로 부상하고 있다. 그러나 점차 증가하는 연구들에서 다양한 MPP 벤치마크 과제에 대한 CLM의 성능이 일관되지 않고 상충되는 결과를 보고하고 있다. 본 연구에서는 MPP를 위한 CLM의 사전 학습 및 미세 조정 성능에 데이터셋 크기, 모델 크기, 표준화 등 다양한 요인들이 미치는 영향을 체계적으로 조사하기 위해 수백 차례에 걸친 정밀하게 통제된 실험을 수행하고 분석한다. 인코더 전용 마스크 언어 모델에 대한 확립된 스케일링 법칙이 부재한 상황에서, 우리의 목표는 MPP 과제에서 CLM의 성능에 영향을 미치는 근본적인 메커니즘에 대한 포괄적인 수치적 증거와 더 깊은 이해를 제공하는 것이며, 이 중 일부 메커니즘은 기존 문헌에서 완전히 간과된 것으로 보인다.

English

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the literature.

분자 특성 예측의 BERTology

BERTology of Molecular Property Prediction

초록

Support