答案存在性驱动RAG重写增益

摘要

检索增强型问答流水线通常将检索到的段落先经过大语言模型重写器处理，再送入较小的阅读器，从而在多跳基准上将F1分数提升数十个百分点；这一提升通常归因于证据质量的改善。我们通过一项受控干预审计提出疑问：这一提升是否因果性地由重写上下文中出现正确答案字符串所驱动，而非由精炼本身驱动。针对每个重写上下文，我们对编译输出进行四种受控编辑后重新运行阅读器：移除正确答案片段、替换一个长度匹配的随机非答案片段（安慰剂）、或将正确答案注入原本缺失该答案的重写中（分别置于前缀位置或中间句子边界处）。在跨三个阅读器族（Qwen2.5-7B、Qwen3.5-35B、GLM-4.7）、两个数据集（HotpotQA、2WikiMultihopQA）及三种编译配置（仅MA、仅MB、MA+验证）的十二组已完成的（单元、基线）干预运行中，在配对答案编译层级上，移除正确答案导致阅读器F1下降幅度比长度匹配的安慰剂多出28至64个百分点；而在原本缺失正确答案的重写中，将正确答案前置注入使得12组（单元、基线）组合中有10组的F1提升+0.7至+9.7个百分点。一项伴随的五哨兵审计表明，传统的单[MASK]探针本身对哨兵敏感：在2Wiki上，该探针报告了+4.12 F1的“无泄漏残差”，但在四种替代哨兵下翻转至-3.33至-7.81 F1，且其中三种哨兵未通过等价性检验（四种中仅一种通过）。我们并未提出新的重写器或缓解方法；我们发布干预运行器及哨兵面板，以便其他重写器增益声明能够通过相同的标准进行检验。

English

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.