检索增强模型相较大型语言模型能带来多少推理提升?面向混合知识多跳推理的基准框架研究 (注:保留原标题的学术设问形式,将"benchmarking framework"译为"基准框架"以符合中文社科术语习惯,"multi-hop inference"采用计算机领域通用译法"多跳推理","hybrid knowledge"译为"混合知识"以保持概念准确性)
How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge
February 10, 2026
作者: Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu
cs.AI
摘要
大型语言模型在处理需要最新信息和多步推理的知识密集型问题时仍面临挑战。通过融合非结构化文本与结构化知识图谱等混合外部知识来增强模型能力,为成本高昂的持续预训练提供了有前景的替代方案。因此,对其检索与推理能力进行可靠评估变得至关重要。然而现有许多基准测试与模型预训练数据的重叠度日益增加,这意味着答案或支撑知识可能已编码于模型参数中,难以区分真正的检索推理与参数化记忆。我们提出HybridRAG-Bench基准框架,专门评估混合知识上的检索密集型多跳推理能力。该框架基于arXiv最新科学文献自动构建非结构化文本与结构化知识图谱的耦合表示,并生成基于显式推理路径的知识密集型问答对。该框架支持灵活的领域和时间范围选择,可随着模型与知识的演进实现防数据污染的可定制化评估。在人工智能、政策治理和生物信息学三个领域的实验表明,HybridRAG-Bench能有效检验真正的检索推理能力而非参数化记忆,为评估混合知识增强推理系统提供了标准化测试平台。相关代码与数据已在github.com/junhongmit/HybridRAG-Bench发布。
English
Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.