M3DocRAG:多模檢索是實現多頁面多文件理解所需的。
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
November 7, 2024
作者: Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, Mohit Bansal
cs.AI
摘要
文件視覺問答(DocVQA)流程可以回答來自文件的問題,具有廣泛的應用。現有方法主要集中在處理單頁文件,使用多模態語言模型(MLM),或者依賴於基於文本檢索增強生成(RAG)的方法,使用文字提取工具如光學字符識別(OCR)。然而,在真實場景中應用這些方法存在困難:(a)問題通常需要跨不同頁面或文件的信息,MLM無法處理許多長文檔;(b)文件中常常包含重要信息的視覺元素,如圖表,但文字提取工具會忽略它們。我們引入M3DocRAG,一個新穎的多模態RAG框架,靈活適應各種文件上下文(封閉域和開放域)、問題跳躍(單跳和多跳)和證據模態(文本、圖表、圖片等)。M3DocRAG通過多模態檢索器和MLM找到相關文件並回答問題,因此可以有效處理單個或多個文件,同時保留視覺信息。由於先前的DocVQA數據集在特定文件上下文中提問,我們還提出M3DocVQA,這是一個新的基準,用於評估超過3,000份PDF文件,總頁數超過40,000頁的開放域DocVQA。在三個基準(M3DocVQA/MMLongBench-Doc/MP-DocVQA)中,實證結果顯示,M3DocRAG與ColPali和Qwen2-VL 7B相比,表現優越,超越了許多強基準,包括在MP-DocVQA中的最新表現。我們對不同索引、MLM和檢索模型進行了全面分析。最後,我們定性展示了M3DocRAG成功應對各種情況,例如當相關信息存在於多個頁面時,以及當答案證據僅存在於圖像中。
English
Document visual question answering (DocVQA) pipelines that answer questions
from documents have broad applications. Existing methods focus on handling
single-page documents with multi-modal language models (MLMs), or rely on
text-based retrieval-augmented generation (RAG) that uses text extraction tools
such as optical character recognition (OCR). However, there are difficulties in
applying these methods in real-world scenarios: (a) questions often require
information across different pages or documents, where MLMs cannot handle many
long documents; (b) documents often have important information in visual
elements such as figures, but text extraction tools ignore them. We introduce
M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various
document contexts (closed-domain and open-domain), question hops (single-hop
and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG
finds relevant documents and answers questions using a multi-modal retriever
and an MLM, so that it can efficiently handle single or many documents while
preserving visual information. Since previous DocVQA datasets ask questions in
the context of a specific document, we also present M3DocVQA, a new benchmark
for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results
show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance
than many strong baselines, including state-of-the-art performance in
MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and
retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully
handle various scenarios, such as when relevant information exists across
multiple pages and when answer evidence only exists in images.