Vision-Guided Chunkingがすべて：マルチモーダル文書理解によるRAGの強化

要旨

検索拡張生成（RAG）システムは情報検索と質問応答に革命をもたらしましたが、従来のテキストベースのチャンキング手法は、複雑なドキュメント構造、複数ページにわたる表、埋め込まれた図表、ページ境界を越えた文脈的依存関係に対処するのに苦労しています。本論文では、大規模マルチモーダルモデル（LMM）を活用し、PDFドキュメントをバッチ処理しながら意味的連続性と構造的整合性を維持する、新しいマルチモーダルドキュメントチャンキング手法を提案します。本手法は、設定可能なページバッチでドキュメントを処理し、バッチ間の文脈を保持することで、複数ページにまたがる表、埋め込まれた視覚要素、手順的コンテンツを正確に扱うことを可能にします。手作業で作成されたクエリを含む精選されたPDFドキュメントデータセットを用いて本手法を評価し、チャンク品質と下流のRAG性能の向上を実証します。本提案の視覚誘導型アプローチは、従来の標準RAGシステムと比較してより高い精度を達成し、定性的分析ではドキュメント構造と意味的連続性の優れた保持を示しています。

English

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.

Vision-Guided Chunkingがすべて：マルチモーダル文書理解によるRAGの強化

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

要旨

Support