M3DocRAG: マルチページ・マルチドキュメント理解に必要なのはマルチモーダル検索である

要旨

文書からの質問に答えるドキュメント視覚質問応答（DocVQA）パイプラインは、幅広い応用が可能である。既存の手法は、マルチモーダル言語モデル（MLM）を用いて単一ページの文書を処理することに焦点を当てるか、光学文字認識（OCR）などのテキスト抽出ツールを使用したテキストベースの検索拡張生成（RAG）に依存している。しかし、これらの手法を現実世界のシナリオに適用するには困難が伴う：（a）質問はしばしば異なるページや文書にまたがる情報を必要とし、MLMは多くの長文書を処理できない；（b）文書には図表などの視覚要素に重要な情報が含まれることが多いが、テキスト抽出ツールはそれらを無視する。我々は、M3DocRAGという新しいマルチモーダルRAGフレームワークを導入する。このフレームワークは、様々な文書コンテキスト（クローズドドメインとオープンドメイン）、質問のホップ数（シングルホップとマルチホップ）、および証拠のモダリティ（テキスト、チャート、図表など）を柔軟に扱うことができる。M3DocRAGは、マルチモーダル検索器とMLMを使用して関連文書を見つけ、質問に答えることで、視覚情報を保持しながら単一または多数の文書を効率的に処理できる。従来のDocVQAデータセットは特定の文書のコンテキストで質問を行うため、我々はまた、3,000以上のPDF文書と40,000以上のページにわたるオープンドメインDocVQAを評価するための新しいベンチマークであるM3DocVQAを提示する。3つのベンチマーク（M3DocVQA/MMLongBench-Doc/MP-DocVQA）において、ColPaliとQwen2-VL 7Bを使用したM3DocRAGは、多くの強力なベースラインを上回る優れた性能を発揮し、MP-DocVQAでは最先端の性能を達成した。我々は、異なるインデックス作成、MLM、および検索モデルの包括的な分析を提供する。最後に、M3DocRAGが複数のページにまたがる関連情報や、画像にのみ存在する回答証拠など、様々なシナリオを成功裏に処理できることを定性的に示す。

English

Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.

M3DocRAG: マルチページ・マルチドキュメント理解に必要なのはマルチモーダル検索である

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

要旨

Support