RARE: 検索拡張生成システムのための検索対応ロバストネス評価

要旨

検索拡張生成（Retrieval-Augmented Generation, RAG）は、回答の最新性と事実性を向上させる。しかし、既存の評価では、これらのシステムが現実世界のノイズや内部および外部の検索コンテキスト間の矛盾、あるいは急速に変化する事実にどの程度対応できるかを十分に検証していない。本研究では、動的で時間感度の高いコーパスに対するクエリおよび文書の摂動を統合的にストレステストするためのフレームワークおよび大規模ベンチマークである「検索対応ロバストネス評価（Retrieval-Aware Robustness Evaluation, RARE）」を提案する。RAREの中核的な特徴の一つは、カスタマイズされたコーパスから単一ホップおよび多段ホップの関係を自動的に抽出し、人手を介さずに多段階の質問セットを生成する知識グラフ駆動の合成パイプライン（RARE-Get）である。このパイプラインを活用し、400の専門家レベルの時間感度の高い金融、経済、政策文書と48,322の質問からなるデータセット（RARE-Set）を構築した。このデータセットの分布は、基盤となる情報源の変化に伴って進化する。ロバストネスを定量化するために、クエリ、文書、または現実世界の検索結果が体系的に変更された際にモデルが正しいままであるか、または回復する能力を捉える検索条件付きロバストネス指標（RARE-Met）を形式化した。結果として、RAGシステムは摂動に対して驚くほど脆弱であり、文書のロバストネスはジェネレータのサイズやアーキテクチャに関わらず一貫して最も弱い点であることが示された。また、RAGシステムはすべてのドメインにおいて、単一ホップクエリよりも多段ホップクエリで一貫して低いロバストネスを示した。

English

Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.

RARE: 検索拡張生成システムのための検索対応ロバストネス評価

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

要旨

Support