SMILES解析によるLLMの化学的理解の向上

要旨

大規模言語モデル（LLM）は、特に分子科学分野において、科学的発見のための強力なツールとしてますます認識されています。これらのモデルにとって基本的な要件は、SMILES表現として一般的に符号化される分子構造を正確に理解する能力です。しかし、現在のLLMはSMILESの解釈に苦戦しており、分子環の数を数えるといった基本的なタスクさえも実行できません。この制限に対処するため、我々はCLEANMOLを導入します。これは、グラフレベルの分子理解を促進するために明示的に設計された、クリーンで決定論的なタスクのスイートとしてSMILES解析を定式化する新しいフレームワークです。これらのタスクは、サブグラフマッチングからグローバルグラフマッチングまで及び、分子構造特性に沿った構造化された監督を提供します。我々は、適応的難易度スコアリングを用いた分子事前学習データセットを構築し、これらのタスクでオープンソースのLLMを事前学習させます。結果は、CLEANMOLが構造理解を強化するだけでなく、Mol-Instructionsベンチマークにおいてベースラインと同等かそれ以上の性能を達成することを示しています。

English

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

SMILES解析によるLLMの化学的理解の向上

Improving Chemical Understanding of LLMs via SMILES Parsing

要旨

Support