M3DocRAG：多模檢索是實現多頁面多文件理解所需的。

摘要

文件視覺問答（DocVQA）流程可以回答來自文件的問題，具有廣泛的應用。現有方法主要集中在處理單頁文件，使用多模態語言模型（MLM），或者依賴於基於文本檢索增強生成（RAG）的方法，使用文字提取工具如光學字符識別（OCR）。然而，在真實場景中應用這些方法存在困難：（a）問題通常需要跨不同頁面或文件的信息，MLM無法處理許多長文檔；（b）文件中常常包含重要信息的視覺元素，如圖表，但文字提取工具會忽略它們。我們引入M3DocRAG，一個新穎的多模態RAG框架，靈活適應各種文件上下文（封閉域和開放域）、問題跳躍（單跳和多跳）和證據模態（文本、圖表、圖片等）。M3DocRAG通過多模態檢索器和MLM找到相關文件並回答問題，因此可以有效處理單個或多個文件，同時保留視覺信息。由於先前的DocVQA數據集在特定文件上下文中提問，我們還提出M3DocVQA，這是一個新的基準，用於評估超過3,000份PDF文件，總頁數超過40,000頁的開放域DocVQA。在三個基準（M3DocVQA/MMLongBench-Doc/MP-DocVQA）中，實證結果顯示，M3DocRAG與ColPali和Qwen2-VL 7B相比，表現優越，超越了許多強基準，包括在MP-DocVQA中的最新表現。我們對不同索引、MLM和檢索模型進行了全面分析。最後，我們定性展示了M3DocRAG成功應對各種情況，例如當相關信息存在於多個頁面時，以及當答案證據僅存在於圖像中。

English

Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.

M3DocRAG：多模檢索是實現多頁面多文件理解所需的。

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

摘要

Support