CUE-R：突破检索增强生成的最终答案局限

摘要

随着语言模型从单次答案生成转向在推理过程中检索并利用证据的多步推理，评估单个检索项的作用变得愈发重要。现有的RAG评估通常关注最终答案质量、引证忠实度或答案级归因，但均未直接针对本文研究的基于干预的逐项证据效用视角。我们提出CUE-R——一个基于轻量级干预的框架，通过浅层可观测的检索使用轨迹来测量单次RAG中逐项证据的操作效用。CUE-R通过移除、替换和复制操作符对单个证据项进行扰动，随后沿三个效用维度（正确性、基于代理的接地忠实度、置信度误差）及轨迹差异信号测量变化。我们还构建了操作型证据角色分类法以解读干预结果。在HotpotQA和2WikiMultihopQA数据集上使用Qwen-3 8B和GPT-5.2的实验显示一致规律：移除和替换操作会显著损害正确性与接地性，同时引发大幅轨迹偏移；而复制操作虽常呈现答案冗余性，却未完全保持行为中性。零检索对照实验证实这些效应源于有效检索的退化。双支持项消融实验进一步表明多跳证据项可能产生非叠加式交互：同时移除两项支持造成的性能损失远超过单独移除任一项目。我们的结果表明，仅评估答案会遗漏重要证据效应，而基于干预的效用分析可作为RAG评估的有效补充手段。

English

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.