RAGCap-Bench:評估大語言模型於代理式檢索增強生成系統中的能力基準
RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems
October 15, 2025
作者: Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
cs.AI
摘要
檢索增強生成(Retrieval-Augmented Generation, RAG)通過動態檢索外部信息,有效緩解了大型語言模型(Large Language Models, LLMs)在事實錯誤、知識過時及虛構內容等方面的關鍵限制。近期研究進一步拓展了這一範式,提出了代理式RAG系統,其中LLMs作為代理,能夠迭代地規劃、檢索並對複雜查詢進行推理。然而,這些系統在處理具有挑戰性的多跳問題時仍顯吃力,且其中間推理能力尚未得到充分探索。為此,我們提出了RAGCap-Bench,這是一個面向能力的基準測試,旨在對代理式RAG工作流程中的中間任務進行細粒度評估。我們通過分析頂尖系統的輸出,識別出常見任務及其執行所需的核心能力,進而構建了典型LLM錯誤的分類體系,以設計有針對性的評估問題。實驗表明,具備更強RAGCap性能的“慢思考”模型能夠取得更好的端到端結果,這不僅驗證了基準測試的有效性,也凸顯了提升這些中間能力的重要性。
English
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large
Language Models (LLMs)-such as factual errors, outdated knowledge, and
hallucinations-by dynamically retrieving external information. Recent work
extends this paradigm through agentic RAG systems, where LLMs act as agents to
iteratively plan, retrieve, and reason over complex queries. However, these
systems still struggle with challenging multi-hop questions, and their
intermediate reasoning capabilities remain underexplored. To address this, we
propose RAGCap-Bench, a capability-oriented benchmark for fine-grained
evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs
from state-of-the-art systems to identify common tasks and the core
capabilities required for their execution, then construct a taxonomy of typical
LLM errors to design targeted evaluation questions. Experiments show that
"slow-thinking" models with stronger RAGCap performance achieve better
end-to-end results, underscoring the benchmark's validity and the importance of
enhancing these intermediate capabilities.