ChatPaper.aiChatPaper

稀有性:面向检索增强生成系统的检索感知鲁棒性评估

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

June 1, 2025
作者: Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
cs.AI

摘要

檢索增強生成(RAG)技術提升了回答的時效性與事實準確性。然而,現有的評估方法鮮少測試這些系統如何應對現實世界中的噪音、內部與外部檢索上下文之間的衝突,或是快速變動的事實。我們引入了檢索感知的魯棒性評估(RARE),這是一個統一的框架與大規模基準,旨在對動態且時效性強的語料庫進行查詢與文獻擾動的聯合壓力測試。RARE的核心特徵之一是其基於知識圖譜的合成管道(RARE-Get),該管道能自動從定製語料庫中提取單跳與多跳關係,並無需人工干預即可生成多層級問題集。利用這一管道,我們構建了一個數據集(RARE-Set),涵蓋了400份專家級別的時效性財經、經濟與政策文獻,以及48,322個問題,這些問題的分佈隨著基礎來源的變化而演進。為了量化系統的韌性,我們形式化了檢索條件下的魯棒性指標(RARE-Met),這些指標捕捉了模型在查詢、文獻或現實世界檢索結果被系統性改變時,保持正確或恢復的能力。我們的結果顯示,RAG系統對擾動表現出驚人的脆弱性,無論生成器的大小或架構如何,文獻魯棒性始終是最薄弱的環節。在所有領域中,RAG系統在多跳查詢上的魯棒性均低於單跳查詢。
English
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.
PDF52June 3, 2025