oMeBench:迈向有机机理阐释与推理中大型语言模型的稳健基准测试
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
October 9, 2025
作者: Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji
cs.AI
摘要
有機反應機制是指反應物逐步形成中間體和產物的基本反應過程,對於理解化學反應性和設計新分子及反應至關重要。儘管大型語言模型(LLMs)在理解化學任務(如合成設計)方面展現出潛力,但尚不清楚這在多大程度上反映了真正的化學推理能力,即生成有效中間體、保持化學一致性以及遵循邏輯連貫的多步驟路徑的能力。為此,我們引入了oMeBench,這是有機化學領域首個大規模、專家策劃的有機機制推理基準。它包含超過10,000個帶有中間體、類型標籤和難度評級的註解機制步驟。此外,為了更精確地評估LLM能力並實現細粒度評分,我們提出了oMeS,這是一個結合步驟級邏輯和化學相似性的動態評估框架。我們分析了最先進LLM的表現,結果顯示,儘管當前模型展現出有前景的化學直覺,但在正確且一致的多步驟推理方面仍存在困難。值得注意的是,我們發現,使用提示策略並在我們提出的數據集上微調專用模型,其性能比領先的閉源模型提高了50%。我們希望oMeBench能作為推動AI系統實現真正化學推理的堅實基礎。
English
Organic reaction mechanisms are the stepwise elementary reactions by which
reactants form intermediates and products, and are fundamental to understanding
chemical reactivity and designing new molecules and reactions. Although large
language models (LLMs) have shown promise in understanding chemical tasks such
as synthesis design, it is unclear to what extent this reflects genuine
chemical reasoning capabilities, i.e., the ability to generate valid
intermediates, maintain chemical consistency, and follow logically coherent
multi-step pathways. We address this by introducing oMeBench, the first
large-scale, expert-curated benchmark for organic mechanism reasoning in
organic chemistry. It comprises over 10,000 annotated mechanistic steps with
intermediates, type labels, and difficulty ratings. Furthermore, to evaluate
LLM capability more precisely and enable fine-grained scoring, we propose oMeS,
a dynamic evaluation framework that combines step-level logic and chemical
similarity. We analyze the performance of state-of-the-art LLMs, and our
results show that although current models display promising chemical intuition,
they struggle with correct and consistent multi-step reasoning. Notably, we
find that using prompting strategy and fine-tuning a specialist model on our
proposed dataset increases performance by 50% over the leading closed-source
model. We hope that oMeBench will serve as a rigorous foundation for advancing
AI systems toward genuine chemical reasoning.