UNIDOC-BENCH: ドキュメント中心型マルチモーダルRAGのための統一ベンチマーク

要旨

マルチモーダル検索拡張生成（MM-RAG）は、大規模言語モデル（LLMs）やエージェントを現実世界の知識ベースに適用するための重要なアプローチである。しかし、現在の評価は断片的であり、テキストや画像を単独で扱うか、またはドキュメント中心のマルチモーダルユースケースを捉えられない簡素化されたマルチモーダル設定に焦点を当てている。本論文では、8つのドメインにわたる70,000ページの実世界のPDFページから構築された、初の大規模で現実的なMM-RAGベンチマークであるUniDoc-Benchを紹介する。私たちのパイプラインは、テキスト、表、図から証拠を抽出しリンクさせ、その後、事実検索、比較、要約、論理的推論クエリにまたがる1,600のマルチモーダルQAペアを生成する。信頼性を確保するため、QAペアの20%は複数のアノテーターと専門家の裁定によって検証されている。UniDoc-Benchは、統一されたプロトコルと標準化された候補プール、プロンプト、評価指標の下で、以下の4つのパラダイムを公平に比較することをサポートする：(1) テキストのみ、(2) 画像のみ、(3) マルチモーダルテキスト-画像融合、(4) マルチモーダル共同検索。私たちの実験では、マルチモーダルテキスト-画像融合RAGシステムが、単一モーダルおよび共同マルチモーダル埋め込みベースの検索を一貫して上回り、テキストや画像だけでは不十分であり、現在のマルチモーダル埋め込みが依然として不十分であることを示している。ベンチマークを超えて、私たちの分析は、視覚的コンテキストがテキストの証拠を補完するタイミングと方法を明らかにし、体系的な失敗モードを発見し、より堅牢なMM-RAGパイプラインを開発するための実践的なガイダンスを提供する。

English

Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.

UNIDOC-BENCH: ドキュメント中心型マルチモーダルRAGのための統一ベンチマーク

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

要旨

Support