ChatPaper.aiChatPaper

oMeBench:迈向有机机理阐释与推理中大型语言模型的稳健基准测试

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

October 9, 2025
作者: Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji
cs.AI

摘要

有机反应机制是指反应物通过一系列基本步骤形成中间体和产物的过程,对于理解化学反应活性及设计新分子和反应至关重要。尽管大型语言模型(LLMs)在诸如合成设计等化学任务中展现出潜力,但其是否真正具备化学推理能力——即生成有效中间体、保持化学一致性以及遵循逻辑连贯的多步路径——尚不明确。为此,我们引入了oMeBench,这是首个大规模、专家策划的有机化学机制推理基准,包含超过10,000个带有中间体、类型标签和难度评级的注释机制步骤。此外,为了更精确评估LLM能力并实现细粒度评分,我们提出了oMeS,一个结合步骤逻辑与化学相似性的动态评估框架。我们分析了当前顶尖LLMs的表现,结果显示,尽管现有模型展现出一定的化学直觉,但在正确且一致的多步推理上仍存在困难。值得注意的是,我们发现,采用提示策略并在我们提出的数据集上微调专业模型,其性能较领先的闭源模型提升了50%。我们期待oMeBench能为推动AI系统实现真正的化学推理奠定坚实基础。
English
Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.
PDF44October 14, 2025