ChatPaper.aiChatPaper

透過探測知識與推理來解析大型語言模型的科學問題解決能力

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

August 26, 2025
作者: Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan
cs.AI

摘要

科學問題解決對大型語言模型(LLMs)提出了獨特的挑戰,既需要深厚的領域知識,又需具備通過複雜推理應用這些知識的能力。儘管自動化科學推理器在輔助人類科學家方面展現出巨大潛力,但目前尚無廣泛採用的全面基準來評估科學推理能力,且少有方法能系統性地區分知識與推理在這些任務中的不同角色。為填補這些空白,我們引入了SciReas,這是一套多樣化的現有科學推理任務基準集,以及SciReas-Pro,一個需要更複雜推理的選擇性子集。我們的全面評估揭示了僅依賴單一基準時無法察覺的科學推理性能洞察。隨後,我們提出了KRUX,這是一個探針框架,用於研究科學任務中推理與知識的各自作用。結合兩者,我們進行了深入分析,得出了幾個關鍵發現:(1) 從模型參數中檢索任務相關知識是LLMs在科學推理中的關鍵瓶頸;(2) 推理模型在推理增強基礎上,持續受益於上下文添加的外部知識;(3) 增強言語化推理能提升LLMs提取任務相關知識的能力。最後,我們進行了一項輕量級分析,將我們以科學為中心的數據構成與同期長鏈思維微調(CoT SFT)的努力進行比較,並發布了SciLit01,一個用於科學推理的強大8B基線模型。
English
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
PDF32August 27, 2025