视觉引导分块即所需：通过多模态文档理解增强RAG

摘要

检索增强生成（RAG）系统已革新了信息检索与问答领域，然而传统的基于文本的分块方法在处理复杂文档结构、跨页表格、嵌入图表及页面间上下文依赖时面临挑战。我们提出了一种新颖的多模态文档分块方法，该方法利用大型多模态模型（LMMs）批量处理PDF文档，同时保持语义连贯性和结构完整性。我们的方法通过可配置的页面批次处理文档，并保留跨批次上下文，从而准确处理跨页表格、嵌入视觉元素及程序性内容。我们在精心挑选的PDF文档数据集上，结合手工设计的查询进行评估，展示了分块质量及下游RAG性能的提升。相较于传统的朴素RAG系统，我们的视觉引导方法实现了更高的准确性，定性分析表明其在文档结构和语义连贯性保护方面表现更优。

English

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.

视觉引导分块即所需：通过多模态文档理解增强RAG

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

摘要

Support