CUE-R:突破检索增强生成中的最终答案局限
CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation
April 7, 2026
作者: Siddharth Jain, Venkat Narayan Vedam
cs.AI
摘要
随着语言模型从单次答案生成转向在推理过程中检索并消化证据的多步推理模式,评估单个检索项的作用变得愈发重要。现有的RAG评估通常关注最终答案质量、引证忠实度或答案级归因度,但这些方法均未直接针对我们研究的基于干预的单元证据效用视角。我们提出CUE-R框架,该轻量级干预框架通过浅层可观测的检索使用轨迹,衡量单次RAG中单元证据的操作效用。CUE-R通过移除、替换和复制三种操作符对单个证据项进行干预,进而沿三个效用维度(正确性、基于代理指标的 grounding 忠实度、置信度误差)及轨迹差异信号测量变化。我们还建立了操作证据角色分类法以解读干预结果。在HotpotQA和2WikiMultihopQA数据集上使用Qwen-3 8B与GPT-5.2的实验呈现一致规律:移除和替换操作会显著损害正确性与grounding效果并引发大幅轨迹偏移,而复制操作虽常保持答案冗余性却未完全保持行为中性。零检索对照实验证实这些效应源于有效检索的退化,双支持证据消融实验进一步表明多跳证据项存在非叠加式交互:同时移除两个支持证据对性能的损害远超单个移除。研究结果说明仅评估答案会遗漏重要证据效应,而基于干预的效用分析可作为RAG评估的有效补充方案。
English
As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.