通過SMILES解析提升大型語言模型的化學理解能力

摘要

大型語言模型（LLMs）日益被視為科學發現的強大工具，尤其是在分子科學領域。這些模型的一個基本要求是能夠準確理解分子結構，通常以SMILES表示法編碼。然而，現有的LLMs在解讀SMILES方面存在困難，甚至無法完成如計數分子環等基本任務。為解決這一限制，我們引入了CLEANMOL，這是一種新穎的框架，將SMILES解析轉化為一系列清晰且確定性的任務，這些任務專門設計來促進圖層面的分子理解。這些任務涵蓋從子圖匹配到全局圖匹配，提供了與分子結構特性相一致的結構化監督。我們構建了一個具有自適應難度評分的分子預訓練數據集，並在這些任務上對開源LLMs進行了預訓練。我們的結果表明，CLEANMOL不僅增強了結構理解，還在Mol-Instructions基準測試中取得了最佳成績或與基線模型相當的表現。

English

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

通過SMILES解析提升大型語言模型的化學理解能力

Improving Chemical Understanding of LLMs via SMILES Parsing

摘要

Support