RARE: 검색 증강 생성 시스템을 위한 검색 인식 강건성 평가

초록

검색 강화 생성(Retrieval-Augmented Generation, RAG)은 답변의 최신성과 사실성을 향상시킵니다. 그러나 기존 평가는 이러한 시스템이 실제 세계의 노이즈, 내부 및 외부 검색 컨텍스트 간의 충돌, 또는 빠르게 변화하는 사실에 얼마나 잘 대처하는지를 거의 테스트하지 않습니다. 우리는 동적이고 시간에 민감한 코퍼스에 대한 질의 및 문서 변형을 통합적으로 스트레스 테스트하는 통합 프레임워크 및 대규모 벤치마크인 검색 인식 강건성 평가(Retrieval-Aware Robustness Evaluation, RARE)를 소개합니다. RARE의 핵심 기능 중 하나는 맞춤형 코퍼스에서 단일 및 다중 홉 관계를 자동으로 추출하고 수동 개입 없이 다중 수준 질문 세트를 생성하는 지식 그래프 기반 합성 파이프라인(RARE-Get)입니다. 이 파이프라인을 활용하여, 우리는 400개의 전문가 수준의 시간에 민감한 금융, 경제 및 정책 문서와 48,322개의 질문으로 구성된 데이터셋(RARE-Set)을 구축했습니다. 이 데이터셋의 분포는 기반 소스가 변화함에 따라 진화합니다. 강건성을 정량화하기 위해, 우리는 질의, 문서 또는 실제 검색 결과가 체계적으로 변경될 때 모델이 정확성을 유지하거나 복구하는 능력을 포착하는 검색 조건 강건성 메트릭(RARE-Met)을 공식화했습니다. 우리의 결과는 RAG 시스템이 변형에 대해 놀라울 정도로 취약하며, 문서 강건성이 생성기 크기나 아키텍처에 관계없이 일관되게 가장 약한 부분임을 보여줍니다. RAG 시스템은 모든 도메인에서 단일 홉 질문보다 다중 홉 질문에서 더 낮은 강건성을 보입니다.

English

Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and policy documents and 48,322 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our results show that RAG systems exhibit surprising vulnerability to perturbations, with document robustness consistently being the weakest point regardless of generator size or architecture. RAG systems consistently show lower robustness on multi-hop queries than single-hop queries across all domains.

RARE: 검색 증강 생성 시스템을 위한 검색 인식 강건성 평가

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

초록

Support