答案存在驅動 RAG 重寫增益

摘要

檢索增強問答管道常將檢索段落經由大型語言模型重寫器處理後，再送入較小的閱讀器，這能在多跳基準測試中將F1分數提升數十個百分點；此成效通常歸因於證據品質的提升。我們透過控制干預審計，探討此提升是否因果性地來自重寫文本中出現正確答案字串，而非重寫本身。針對每個重寫文本，我們對編譯輸出進行四種控制編輯後重新執行閱讀器：移除正確答案跨度、替換長度配對的隨機非答案跨度（安慰劑）、或將正確答案注入原本沒有的重寫文本中（置於前綴或中間句邊界）。在涵蓋三種閱讀器系列（Qwen2.5-7B、Qwen3.5-35B、GLM-4.7）、兩個資料集（HotpotQA、2WikiMultihopQA），以及三種編譯器配置（僅MA、僅MB、MA+驗證）的十二組完整（單元、基準線）干預運行中，在配對的答案存在於編譯的層級上，移除正確答案使閱讀器F1分數較長度配對的安慰劑下降28至64個百分點；而將原本缺乏的正確答案前置注入，在12組（單元、基準線）組合中的10組，使F1提升+0.7至+9.7個百分點。一項配套的五哨兵審計顯示，傳統的單一[MASK]探測本身即對哨兵敏感：在2Wiki資料集上，它報告出+4.12 F1的「非洩漏殘差」，但在四種替代哨兵下轉變為-3.33至-7.81 F1，且在四種替代哨兵中有三種未通過等價性檢定（僅1/4通過）。我們並未提出新的重寫器或緩解方法；我們釋出干預運行器與哨兵面板，以便其他重寫器成效主張能接受相同標準的檢驗。

English

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.