突破网格布局:基于解析视觉文档表征的布局感知多向量检索方法
Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
March 2, 2026
作者: Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu
cs.AI
摘要
要充分釋放視覺豐富文件的潛力,需依賴能同時理解文本與複雜版面的檢索系統,這是視覺文件檢索(VDR)的核心挑戰。現有多向量架構雖功能強大,卻面臨關鍵的存儲瓶頸——現有優化策略(如嵌入合併、剪枝或使用抽象標記)若非犧牲性能就是忽略關鍵版面線索,皆無法有效解決此問題。為此,我們提出ColParse新範式:通過文件解析模型生成少量具版面感知的子圖像嵌入,再與全局頁面級向量融合,形成緊湊且結構化的多向量表徵。大量實驗表明,該方法能在多個基準測試和基礎模型中實現超過95%的存儲需求壓縮,同時顯著提升檢索性能。ColParse由此彌合了多向量檢索的細粒度精度與大規模部署實際需求間的關鍵鴻溝,為構建高效可解釋的多模態信息系統開辟了新路徑。
English
Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.