ChatPaper.aiChatPaper

检索增强推理沙盒:解构检索与推理能力的新基准

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

January 29, 2026
作者: Shuangshuang Ying, Zheyu Wang, Yunjian Peng, Jin Chen, Yuhao Wu, Hongbin Lin, Dingyu He, Siyi Liu, Gengchen Yu, YinZhu Piao, Yuchen Wu, Xin Gui, Zhongyuan Peng, Xin Li, Xeron Du, Libo Qin, YiXin Cao, Ge Zhang, Stephen Huang
cs.AI

摘要

尽管现有基准测试表现强劲,但大型语言模型能否对真正新颖的科学信息进行推理仍不明确。当前多数评估方法针对端到端RAG流程进行评分,其中推理能力与检索机制及工具链选择相互混淆,且评估信号进一步受到参数化记忆和开放网络波动性的干扰。我们推出DeR2这一受控深度研究沙箱,在保留深度搜索核心难点——多步信息整合、去噪处理和基于证据的结论生成——的同时,实现了文档支撑推理的隔离评估。DeR2通过四种机制解耦证据获取与推理过程:仅指令模式、核心概念模式(提供标注概念但无文档)、相关文档模式(仅提供相关文档)及全文档集模式(相关文档加主题相关干扰项),由此产生可解释的机制差距,将检索损失与推理损失操作化,实现细粒度错误归因。为防范参数泄露,我们采用两阶段验证流程,要求模型在无证据支持时必然失败,同时确保基于标注概念的可解性。为保证可复现性,每个测试实例均配备冻结文档库(源自2023-2025年理论论文)及专家标注的概念体系与已验证推理链条。在多类前沿基础模型上的实验表明存在显著性能差异与提升空间:部分模型呈现模式切换脆弱性,在全文档集模式下的表现反而不如仅指令模式;另一些模型则出现结构性概念误用,能正确提及概念却无法将其转化为执行流程。
English
Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.
PDF144February 7, 2026