비전 기반 청킹이 전부다: 멀티모달 문서 이해를 통한 RAG 강화

초록

검색 증강 생성(Retrieval-Augmented Generation, RAG) 시스템은 정보 검색과 질문 응답 분야에서 혁신을 가져왔지만, 기존의 텍스트 기반 청킹 방법은 복잡한 문서 구조, 다중 페이지 표, 내장된 그림, 그리고 페이지 경계를 넘는 문맥적 의존성을 처리하는 데 어려움을 겪습니다. 본 연구에서는 대형 멀티모달 모델(Large Multimodal Models, LMMs)을 활용하여 PDF 문서를 배치로 처리하면서도 의미적 일관성과 구조적 무결성을 유지하는 새로운 멀티모달 문서 청킹 접근법을 제안합니다. 우리의 방법은 교차 배치 문맥 보존 기능을 갖춘 구성 가능한 페이지 배치로 문서를 처리함으로써, 여러 페이지에 걸친 표, 내장된 시각 요소, 그리고 절차적 내용을 정확하게 처리할 수 있습니다. 우리는 수동으로 작성된 질문이 포함된 PDF 문서 데이터셋을 통해 이 접근법을 평가하였으며, 청크 품질과 하위 RAG 성능의 개선을 입증하였습니다. 우리의 시각 지향적 접근법은 기존의 일반 RAG 시스템에 비해 더 나은 정확도를 달성하며, 문서 구조와 의미적 일관성의 우수한 보존을 보여주는 질적 분석 결과를 제시합니다.

English

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.

비전 기반 청킹이 전부다: 멀티모달 문서 이해를 통한 RAG 강화

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

초록

Support