ChatPaper.aiChatPaper

RAGCap-Bench:大语言模型在代理式检索增强生成系统中的能力基准测试

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

October 15, 2025
作者: Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
cs.AI

摘要

检索增强生成(RAG)通过动态获取外部信息,有效缓解了大型语言模型(LLMs)在事实错误、知识过时及幻觉等方面的关键局限。近期研究通过代理式RAG系统扩展了这一范式,其中LLMs作为代理,迭代地规划、检索并推理复杂查询。然而,这些系统在处理具有挑战性的多跳问题时仍显不足,且其中间推理能力尚未得到充分探索。为此,我们提出了RAGCap-Bench,一个面向能力的基准测试,用于细粒度评估代理式RAG工作流中的中间任务。我们分析了顶尖系统的输出,识别出执行这些任务所需的常见任务及核心能力,进而构建了典型LLM错误的分类体系,以设计有针对性的评估问题。实验表明,具备更强RAGCap性能的“慢思考”模型在端到端结果上表现更优,这验证了基准测试的有效性,并凸显了提升这些中间能力的重要性。
English
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.
PDF12October 17, 2025