回答の存在がRAGリライトの効果を促進する

要旨

検索拡張型QAパイプラインでは、取得したパッセージを、小型のリーダーに入力する前にLLMリライターに通すことが多く、マルチホップベンチマークにおいてF1スコアを数十ポイント向上させる。この改善は、通常、証拠品質の向上に起因するとされている。本研究では、その向上が、リライターによる証拠の精選そのものではなく、書き換えられたコンテキストに正解文字列が出現することに因果的に起因するのかどうかを、制御された介入監査によって検証する。書き換えられた各コンテキストに対して、コンパイラの出力に対して4種類の制御された編集（正解スパンの削除、長さを一致させたランダムな非正解スパンへの置換［プラセボ］、正解が存在しなかった書き換えへの正解の注入［接頭辞または文の中間境界の位置］）のいずれかを施した後、リーダーを再実行する。 3つのリーダーファミリー（Qwen2.5-7B、Qwen3.5-35B、GLM-4.7）、2つのデータセット（HotpotQA、2WikiMultihopQA）、および3つのコンパイラ構成（MAのみ、MBのみ、MA+検証）にわたる12件の(セル, ベースライン)介入実行において、コンパイル内の回答有無で層別したペアデータ上で正解を削除すると、長さを一致させたプラセボと比較してリーダーのF1が28～64ポイント低下する。また、正解が存在しなかった書き換えの先頭に正解を追加すると、12件中10件の(セル, ベースライン)組み合わせでF1が+0.7～+9.7ポイント上昇する。付随する5センチネル監査では、従来の単一[MASK]プローブ自体がセンチネルに対して脆弱であることが示される。すなわち、2Wikiにおいてそれは+4.12 F1の「非リーク残差」を報告するが、別の4種類のセンチネルでは-3.33～-7.81 F1に反転し、そのうち3種類のセンチネルでは等価性テストに不合格となる（4種類中1種類のみ合格）。我々は新しいリライターや緩和策を提案するのではなく、介入ランナーとセンチネルパネルを公開する。これにより、他のリライターによる利得の主張も同一基準で検証可能となる。

English

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.