답변 존재 여부가 RAG 재작성 성과를 주도한다

초록

검색 증강 QA 파이프라인은 종종 검색된 구절을 소형 판독기에 전달하기 전에 LLM 재작성기를 거치게 하여, 다중 홉 벤치마크에서 F1 점수를 수십 포인트 향상시킵니다. 이러한 향상은 일반적으로 개선된 증거 품질 덕분으로 간주됩니다. 본 연구는 해당 향상이 선별 작업 자체보다는 재작성된 맥락에 정답 문자열이 나타나는 것에 인과적으로 기인하는지 묻고, 통제된 개입 감사를 통해 분석합니다. 각 재작성 맥락에 대해 편집 결과물의 네 가지 통제된 편집(정답 구간 제거, 길이가 일치하는 무작위 비정답 구간(위약) 교체, 또는 재작성에 없던 정답을 접두사 또는 중간 문장 경계에 주입) 중 하나를 적용한 후 판독기를 다시 실행합니다. 세 가지 판독기 패밀리(Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), 두 가지 데이터셋(HotpotQA, 2WikiMultihopQA), 세 가지 컴파일러 배열(MA 전용, MB 전용, MA+확인)에 걸친 12개의 완료된 (셀, 기준선) 개입 실행에서, 정답을 제거하면 길이 일치 위약 대비 판독기 F1이 쌍을 이룬 정답-컴파일 계층에서 28~64포인트 하락했으며, 정답이 없던 재작성에 정답을 접두사로 추가하면 12개 (셀, 기준선) 조합 중 10개에서 F1이 +0.7~+9.7포인트 상승했습니다. 동반된 5-센티넬 감사는 기존의 단일 [MASK] 프로브 자체가 센티넬에 취약함을 보여줍니다. 2Wiki에서는 +4.12 F1의 "비누출 잔차"를 보고하지만, 네 가지 대체 센티넬 하에서는 -3.33~-7.81 F1으로 역전되며, 이 중 세 가지에 대한 동등성 검정을 통과하지 못합니다(1/4 통과). 본 연구는 새로운 재작성기나 완화 방법을 제안하지 않으며, 다른 재작성기 성능 향상 주장이 동일한 기준으로 검증될 수 있도록 개입 실행기와 센티넬 패널을 공개합니다.

English

Retrieval-augmented QA pipelines often route retrieved passages through an LLM rewriter before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA+verify), removing the gold answer drops reader F1 by 28 to 64 points beyond the length-matched placebo on paired answer-in-compile strata, and prepending the gold into rewrites that lacked it raises F1 by +0.7 to +9.7 points in 10 of 12 (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-[MASK] probe is itself sentinel-fragile: on 2Wiki it reports a +4.12~F1 ``non-leakage residual'' that flips to -3.33 to -7.81~F1 under four alternative sentinels and fails an equivalence test for three of those four (1/4~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.