网格之外:基于解析视觉文档表征的布局感知多向量检索
Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
March 2, 2026
作者: Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu
cs.AI
摘要
要充分释放视觉丰富文档的潜力,检索系统需突破单纯文本理解,掌握复杂版式结构——这正是视觉文档检索(VDR)的核心挑战。主流多向量架构虽功能强大,却面临关键的存储瓶颈:现有优化策略(如嵌入合并、剪枝或抽象标记)在保持性能不损失且不忽略关键版式线索的前提下均无法解决此问题。为此,我们提出ColParse新范式,通过文档解析模型生成少量具备版式感知的子图像嵌入,再与全局页面级向量融合,形成紧凑且结构敏感的多向量表示。大量实验表明,该方法在多个基准测试和基础模型上实现存储需求降低95%以上的同时,还带来显著性能提升。ColParse由此弥合了多向量检索的细粒度精度与大规模部署实际需求之间的关键鸿沟,为构建高效可解释的多模态信息系统开辟了新路径。
English
Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.