RARE:面向检索增强生成系统的检索感知鲁棒性评估
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
June 1, 2025
作者: Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
cs.AI
摘要
检索增强生成(RAG)技术提升了答案的时效性与事实准确性。然而,现有评估体系鲜少检验这些系统如何应对现实世界中的噪声、内部与外部检索上下文间的冲突,以及快速变化的事实。我们提出了检索感知的鲁棒性评估框架(RARE),这是一个统一的框架及大规模基准测试,旨在联合压力测试动态、时效性语料库上的查询与文档扰动。RARE的核心特性之一是其知识图谱驱动的合成管道(RARE-Get),该管道能够自动从定制语料库中提取单跳及多跳关系,并无需人工干预即可生成多层次问题集。利用这一管道,我们构建了一个数据集(RARE-Set),涵盖400份专家级时效性金融、经济与政策文档及48,322个问题,其分布随基础来源的变化而演变。为量化鲁棒性,我们形式化了检索条件鲁棒性指标(RARE-Met),这些指标捕捉了模型在查询、文档或现实世界检索结果被系统性改变时保持正确或恢复的能力。我们的研究结果显示,RAG系统对扰动表现出惊人的脆弱性,文档鲁棒性始终是最薄弱环节,无论生成器规模或架构如何。在所有领域中,RAG系统在多跳查询上的鲁棒性均低于单跳查询。
English
Retrieval-Augmented Generation (RAG) enhances recency and factuality in
answers. However, existing evaluations rarely test how well these systems cope
with real-world noise, conflicting between internal and external retrieved
contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness
Evaluation (RARE), a unified framework and large-scale benchmark that jointly
stress-tests query and document perturbations over dynamic, time-sensitive
corpora. One of the central features of RARE is a knowledge-graph-driven
synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop
relations from the customized corpus and generates multi-level question sets
without manual intervention. Leveraging this pipeline, we construct a dataset
(RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and
policy documents and 48,322 questions whose distribution evolves as the
underlying sources change. To quantify resilience, we formalize
retrieval-conditioned robustness metrics (RARE-Met) that capture a model's
ability to remain correct or recover when queries, documents, or real-world
retrieval results are systematically altered. Our results show that RAG systems
exhibit surprising vulnerability to perturbations, with document robustness
consistently being the weakest point regardless of generator size or
architecture. RAG systems consistently show lower robustness on multi-hop
queries than single-hop queries across all domains.Summary
AI-Generated Summary