RAGCap-Bench: 에이전트 기반 검색 증강 생성 시스템에서의 대형 언어 모델 능력 벤치마킹

초록

검색 강화 생성(Retrieval-Augmented Generation, RAG)은 외부 정보를 동적으로 검색함으로써 대형 언어 모델(Large Language Models, LLMs)의 주요 한계점—사실 오류, 구식 지식, 환각 등—을 완화합니다. 최근 연구에서는 LLM이 에이전트로 작동하여 복잡한 질의를 반복적으로 계획, 검색, 추론하는 에이전트 기반 RAG 시스템을 통해 이 패러다임을 확장하고 있습니다. 그러나 이러한 시스템은 여전히 다중 홉(multi-hop) 질문에 어려움을 겪으며, 중간 단계의 추론 능력은 충분히 탐구되지 않고 있습니다. 이를 해결하기 위해, 우리는 에이전트 기반 RAG 워크플로우의 중간 작업을 세밀하게 평가하기 위한 능력 중심 벤치마크인 RAGCap-Bench를 제안합니다. 최신 시스템의 출력을 분석하여 일반적인 작업과 이를 실행하는 데 필요한 핵심 능력을 식별한 후, LLM의 전형적인 오류에 대한 분류 체계를 구축하여 타겟 평가 질문을 설계합니다. 실험 결과, RAGCap 성능이 더 강력한 "느린 사고(slow-thinking)" 모델이 종단 간(end-to-end) 결과에서 더 나은 성과를 보이며, 이 벤치마크의 타당성과 이러한 중간 능력 강화의 중요성을 입증합니다.

English

Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

RAGCap-Bench: 에이전트 기반 검색 증강 생성 시스템에서의 대형 언어 모델 능력 벤치마킹

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

초록

Support