通过SMILES解析提升大语言模型的化学理解能力

摘要

大型语言模型（LLMs）日益被视为科学发现，尤其是分子科学领域中的强大工具。这些模型的一个基本要求是能够准确理解分子结构，通常以SMILES表示法编码。然而，当前的LLMs在解析SMILES方面存在困难，甚至无法完成诸如计数分子环等基础任务。为解决这一局限，我们提出了CLEANMOL，一个创新框架，它将SMILES解析转化为一系列明确设计以促进图级别分子理解的清洁且确定性的任务。这些任务涵盖从子图匹配到全局图匹配，提供了与分子结构特性对齐的结构化监督。我们构建了一个具有自适应难度评分的分子预训练数据集，并在这些任务上对开源LLMs进行了预训练。结果表明，CLEANMOL不仅增强了结构理解能力，还在Mol-Instructions基准测试中取得了最佳成绩或与基线模型相媲美。

English

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

通过SMILES解析提升大语言模型的化学理解能力

Improving Chemical Understanding of LLMs via SMILES Parsing

摘要

Support