ChatPaper.aiChatPaper

通过探究知识与推理,揭示大语言模型在科学问题解决中的奥秘

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

August 26, 2025
作者: Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan
cs.AI

摘要

科学问题解决对大型语言模型(LLMs)提出了独特挑战,既需要深厚的领域知识,又要求具备通过复杂推理应用这些知识的能力。尽管自动化科学推理器在辅助人类科学家方面展现出巨大潜力,但目前尚缺乏广泛采用的整体基准来评估科学推理能力,且少有方法能系统性地分离知识与推理在这些任务中的不同作用。为填补这些空白,我们引入了SciReas,一套多样化的现有科学推理任务基准集,以及SciReas-Pro,一个需要更复杂推理的精选子集。我们的整体评估揭示了仅依赖单一基准时难以察觉的科学推理性能洞察。随后,我们提出了KRUX,一个用于探究科学任务中推理与知识各自作用的探测框架。结合两者,我们进行了深入分析,得出几个关键发现:(1) 从模型参数中检索任务相关知识是LLMs在科学推理中的关键瓶颈;(2) 推理模型在已有推理增强基础上,持续受益于外部知识的上下文补充;(3) 提升语言化推理能力有助于LLMs更好地提取任务相关知识。最后,我们进行了一项轻量级分析,将我们的科学导向数据构成与同期长链思维微调(CoT SFT)研究进行对比,并发布了SciLit01,一个为科学推理设立的强大8B基线模型。
English
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
PDF32August 27, 2025