視覺引導的區塊劃分即為所需:透過多模態文件理解強化RAG
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
June 19, 2025
作者: Vishesh Tripathi, Tanmay Odapally, Indraneel Das, Uday Allu, Biddwan Ahmed
cs.AI
摘要
檢索增強生成(RAG)系統已徹底革新了資訊檢索與問答領域,但傳統的基於文本的分塊方法在處理複雜文件結構、跨頁表格、嵌入式圖表以及跨頁邊界的上下文依賴性時面臨挑戰。我們提出了一種新穎的多模態文件分塊方法,該方法利用大型多模態模型(LMMs)批量處理PDF文件,同時保持語義連貫性和結構完整性。我們的方法以可配置的頁面批次處理文件,並保留跨批次上下文,從而能夠準確處理跨多頁的表格、嵌入式視覺元素以及程序性內容。我們在一個精心策劃的PDF文件數據集上評估了我們的方法,該數據集包含手動設計的查詢,結果顯示分塊質量和下游RAG性能均有所提升。與傳統的普通RAG系統相比,我們基於視覺引導的方法實現了更高的準確性,定性分析顯示其在文件結構和語義連貫性的保留上表現更優。
English
Retrieval-Augmented Generation (RAG) systems have revolutionized information
retrieval and question answering, but traditional text-based chunking methods
struggle with complex document structures, multi-page tables, embedded figures,
and contextual dependencies across page boundaries. We present a novel
multimodal document chunking approach that leverages Large Multimodal Models
(LMMs) to process PDF documents in batches while maintaining semantic coherence
and structural integrity. Our method processes documents in configurable page
batches with cross-batch context preservation, enabling accurate handling of
tables spanning multiple pages, embedded visual elements, and procedural
content. We evaluate our approach on a curated dataset of PDF documents with
manually crafted queries, demonstrating improvements in chunk quality and
downstream RAG performance. Our vision-guided approach achieves better accuracy
compared to traditional vanilla RAG systems, with qualitative analysis showing
superior preservation of document structure and semantic coherence.