그리드를 넘어서: 구문 분석된 시각적 문서 표현을 활용한 레이아웃 인식 다중 벡터 검색

초록

시각적 요소가 풍부한 문서의 전체 잠재력을 활용하려면 텍스트뿐만 아니라 복잡한 레이아웃을 이해하는 검색 시스템이 필요하며, 이는 시각적 문서 검색(VDR)의 핵심 과제입니다. 기존의 다중 벡터 아키텍처는 강력한 성능을 지녔으나, 임베딩 병합, 가지치기 또는 추상 토큰 사용과 같은 현재의 최적화 전략은 성능 저하를 초래하거나 중요한 레이아웃 정보를 무시하지 않고서는 해결할 수 없는 심각한 저장 공간 병목 현상을 겪고 있습니다. 이를 해결하기 위해 우리는 문서 파싱 모델을 활용하여 레이아웃 정보를 반영한 소수의 하위 이미지 임베딩을 생성하고, 이를 전역 페이지 수준 벡터와 융합하여 컴팩트하면서도 구조를 인식하는 다중 벡터 표현을 만들어 내는 새로운 패러다임인 ColParse를 소개합니다. 다양한 실험을 통해 우리의 방법이 저장 공간 요구량을 95% 이상 줄이면서도 여러 벤치마크와 기본 모델에서显著的한 성능 향상을 동시에 달성함을 입증했습니다. 따라서 ColParse는 다중 벡터 검색의 정교한 정확도와 대규모 배포의 실용적 요구 사이의 중요한 격차를 메꾸며, 효율적이고 해석 가능한 다중 모달 정보 시스템을 위한 새로운 길을 제시합니다.

English

Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.

그리드를 넘어서: 구문 분석된 시각적 문서 표현을 활용한 레이아웃 인식 다중 벡터 검색

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

초록

Support