我們評估文檔檢索增強生成的方法是否正確？

摘要

基於多模態大型語言模型（MLLMs）的檢索增強生成（RAG）系統在複雜文檔理解方面展現出巨大潛力，但其發展卻因評估不足而受到嚴重阻礙。現有的基準測試通常僅關注文檔RAG系統的特定部分，並使用合成數據，這些數據缺乏完整的地面真實性和證據標籤，因此無法反映現實世界中的瓶頸與挑戰。為克服這些限制，我們引入了Double-Bench：一個全新的大規模、多語言、多模態評估系統，能夠對文檔RAG系統中的每個組件進行細粒度評估。該系統包含3,276份文檔（72,880頁）和5,168個單跳及多跳查詢，涵蓋6種語言和4種文檔類型，並具備針對潛在數據污染問題的動態更新支持。查詢基於全面掃描的證據頁面，並由人類專家驗證，以確保最高質量和完整性。我們在9種最先進的嵌入模型、4種MLLMs和4種端到端文檔RAG框架上進行的全面實驗表明，文本與視覺嵌入模型之間的差距正在縮小，這凸顯了構建更強文檔檢索模型的必要性。我們的研究還揭示了當前文檔RAG框架中存在的過度自信困境，這些框架傾向於在缺乏證據支持的情況下提供答案。我們希望完全開源的Double-Bench能為未來高級文檔RAG系統的研究提供堅實基礎。我們計劃每年及時更新語料庫並發布新的基準測試。

English

Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.

我們評估文檔檢索增強生成的方法是否正確？

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

摘要

Support