グリッドを超えて：解析された視覚的文書表現を用いたレイアウト情報を考慮したマルチベクトル検索

要旨

視覚的に豊富な文書の真価を引き出すには、テキストだけでなく複雑なレイアウトも理解する検索システムが不可欠であり、これは視覚的文書検索（VDR）の中核的課題である。既存のマルチベクトルアーキテクチャは強力ながら、深刻なストレージのボトルネックに直面しており、埋め込みの統合、枝刈り、抽象トークンの利用といった最適化手法は、性能の犠牲や重要なレイアウト情報の無視を伴わずにこの問題を解決できていない。この課題に対処するため、我々はColParseを提案する。これは文書解析モデルを活用してレイアウト情報を反映した少数のサブ画像埋め込みを生成し、それらをページ全体のベクトルと融合させることで、コンパクトかつ構造を意識したマルチベクトル表現を構築する新たなパラダイムである。大規模な実験により、本手法がストレージ要件を95%以上削減すると同時に、多数のベンチマーク及び基盤モデルにおいて大幅な性能向上をもたらすことが実証された。ColParseは thus、マルチベクトル検索の細粒度な精度と大規模展開の現実的要求との間の重要な隔たりを埋め、効率的で解釈可能なマルチモーダル情報システムへの新たな道筋を提示する。

English

Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.

グリッドを超えて：解析された視覚的文書表現を用いたレイアウト情報を考慮したマルチベクトル検索

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

要旨

Support