ChatPaper.aiChatPaper

RARE:面向检索增强生成系统的检索感知鲁棒性评估

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

June 1, 2025
作者: Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
cs.AI

摘要

检索增强生成(RAG)技术提升了答案的时效性与事实准确性。然而,现有评估体系鲜少检验这些系统如何应对现实世界中的噪声、内部与外部检索上下文间的冲突,以及快速变化的事实。我们提出了检索感知的鲁棒性评估框架(RARE),这是一个统一的框架及大规模基准测试,旨在联合压力测试动态、时效性语料库上的查询与文档扰动。RARE的核心特性之一是其知识图谱驱动的合成管道(RARE-Get),该管道能够自动从定制语料库中提取单跳及多跳关系,并无需人工干预即可生成多层次问题集。利用这一管道,我们构建了一个数据集(RARE-Set),涵盖400份专家级时效性金融、经济与政策文档及48,322个问题,其分布随基础来源的变化而演变。为量化鲁棒性,我们形式化了检索条件鲁棒性指标(RARE-Met),这些指标捕捉了模型在查询、文档或现实世界检索结果被系统性改变时保持正确或恢复的能力。我们的研究结果显示,RAG系统对扰动表现出惊人的脆弱性,文档鲁棒性始终是最薄弱环节,无论生成器规模或架构如何。在所有领域中,RAG系统在多跳查询上的鲁棒性均低于单跳查询。
English
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.

Summary

AI-Generated Summary

PDF42June 3, 2025