Het verbeteren van het chemisch inzicht van LLM's via SMILES-parsing

Samenvatting

Grote taalmmodellen (LLMs) worden steeds meer erkend als krachtige hulpmiddelen voor wetenschappelijke ontdekkingen, met name in de moleculaire wetenschap. Een fundamentele vereiste voor deze modellen is het vermogen om moleculaire structuren nauwkeurig te begrijpen, die doorgaans worden gecodeerd in de SMILES-representatie. Huidige LLMs hebben echter moeite met het interpreteren van SMILES, en slagen er zelfs niet in om basistaken uit te voeren, zoals het tellen van moleculaire ringen. Om deze beperking aan te pakken, introduceren we CLEANMOL, een nieuw raamwerk dat het parsen van SMILES formuleert als een reeks schone en deterministische taken die expliciet zijn ontworpen om begrip op grafenniveau van moleculen te bevorderen. Deze taken variëren van subgraafmatching tot globale graafmatching, en bieden gestructureerde begeleiding die is afgestemd op moleculaire structurele eigenschappen. We construeren een moleculair voor-trainingsdataset met adaptieve moeilijkheidsscores en trainen open-source LLMs voor op deze taken. Onze resultaten tonen aan dat CLEANMOL niet alleen het structurele begrip verbetert, maar ook de beste prestaties levert of concurreert met de baseline op de Mol-Instructions benchmark.

English

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

Het verbeteren van het chemisch inzicht van LLM's via SMILES-parsing

Improving Chemical Understanding of LLMs via SMILES Parsing

Samenvatting

Support