SMILES 파싱을 통한 대형 언어 모델의 화학적 이해 향상

초록

대규모 언어 모델(LLMs)은 특히 분자 과학 분야에서 과학적 발견을 위한 강력한 도구로 점점 더 인식되고 있습니다. 이러한 모델의 기본 요구 사항은 SMILES 표현으로 일반적으로 인코딩된 분자 구조를 정확하게 이해하는 능력입니다. 그러나 현재의 LLMs는 SMILES를 해석하는 데 어려움을 겪으며, 분자 고리 개수를 세는 것과 같은 기본적인 작업조차 수행하지 못합니다. 이러한 한계를 해결하기 위해, 우리는 CLEANMOL이라는 새로운 프레임워크를 소개합니다. CLEANMOL은 SMILES 파싱을 그래프 수준의 분자 이해를 명시적으로 촉진하도록 설계된 일련의 깔끔하고 결정론적인 작업으로 공식화합니다. 이러한 작업은 부분 그래프 매칭부터 전역 그래프 매칭에 이르기까지 분자 구조적 특성과 일치하는 구조화된 지도를 제공합니다. 우리는 적응형 난이도 점수를 가진 분자 사전 학습 데이터셋을 구축하고, 이러한 작업에 대해 오픈소스 LLMs를 사전 학습시킵니다. 우리의 결과는 CLEANMOL이 구조적 이해를 향상시킬 뿐만 아니라 Mol-Instructions 벤치마크에서 최고의 성능을 달성하거나 기준선과 경쟁할 수 있음을 보여줍니다.

English

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.