MetaEmbed：通过灵活后期交互实现测试时多模态检索的扩展

摘要

通用多模态嵌入模型在捕捉查询与候选对象间的语义关联方面取得了显著成功。然而，现有方法要么将查询和候选对象压缩为单一向量，可能限制了细粒度信息的表达能力，要么生成过多向量，导致多向量检索成本过高。本研究中，我们提出了MetaEmbed，一个重新思考大规模多模态嵌入构建与交互的新框架。在训练阶段，固定数量的可学习元标记被附加到输入序列中。测试时，这些标记在最后一层的上下文表示作为紧凑而富有表现力的多向量嵌入。通过提出的嵌套多向量检索训练，MetaEmbed学会了跨多个向量按粒度组织信息。因此，我们实现了多模态检索中的测试时扩展，用户可根据检索质量与效率需求，选择用于索引和检索交互的标记数量。在Massive Multimodal Embedding Benchmark (MMEB) 和 Visual Document Retrieval Benchmark (ViDoRe) 上的广泛评估证实，MetaEmbed在保持对32B参数模型稳健扩展的同时，达到了最先进的检索性能。

English

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval, where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.

MetaEmbed：通过灵活后期交互实现测试时多模态检索的扩展

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

摘要

Support