我们评估文档检索增强生成的方法是否正确？

摘要

采用多模态大语言模型（MLLMs）的检索增强生成（RAG）系统在复杂文档理解方面展现出巨大潜力，但其发展却因评估不足而严重受阻。现有基准测试往往聚焦于文档RAG系统的特定部分，并采用合成数据，这些数据缺乏完整的事实依据和证据标签，因此无法反映现实世界中的瓶颈与挑战。为克服这些局限，我们推出了Double-Bench：一个全新的大规模、多语言、多模态评估系统，能够对文档RAG系统内的每个组件进行细致评估。该系统包含3,276份文档（72,880页）和5,168个单跳及多跳查询，覆盖6种语言和4种文档类型，并具备针对潜在数据污染问题的动态更新支持。所有查询均基于详尽扫描的证据页面，并由人类专家验证，以确保最高质量和完整性。我们通过对9种最先进的嵌入模型、4种MLLMs及4种端到端文档RAG框架的全面实验，发现文本与视觉嵌入模型之间的差距正在缩小，这凸显了构建更强文档检索模型的必要性。我们的研究还揭示了当前文档RAG框架中存在的过度自信问题，即倾向于在缺乏证据支持的情况下提供答案。我们希望完全开源的Double-Bench能为未来高级文档RAG系统的研究奠定坚实基础，并计划每年更新语料库并发布新的基准测试。

English

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

我们评估文档检索增强生成的方法是否正确？

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

摘要

Support