MetaEmbed: 柔軟な後期相互作用によるテスト時マルチモーダル検索のスケーリング

要旨

ユニバーサルマルチモーダル埋め込みモデルは、クエリと候補間の意味的関連性を捉えることに大きな成功を収めてきました。しかし、現在の手法は、クエリと候補を単一のベクトルに凝縮するため、細粒度の情報の表現力が制限される可能性があるか、あるいは多ベクトル検索には実用的でないほど多くのベクトルを生成してしまいます。本研究では、マルチモーダル埋め込みの構築と大規模な相互作用の方法を再考する新しいフレームワークであるMetaEmbedを紹介します。トレーニング中、固定数の学習可能なメタトークンが入力シーケンスに追加されます。テスト時には、それらの最終層の文脈化された表現が、コンパクトでありながら表現力豊かな多ベクトル埋め込みとして機能します。提案されたマトリョーシカ多ベクトル検索トレーニングを通じて、MetaEmbedは複数のベクトルにわたって情報を粒度別に整理することを学習します。その結果、ユーザーがインデックス作成と検索相互作用に使用するトークンの数を選択することで、検索品質と効率性の要求のバランスを取ることができるマルチモーダル検索におけるテスト時のスケーリングを可能にします。Massive Multimodal Embedding Benchmark（MMEB）とVisual Document Retrieval Benchmark（ViDoRe）での広範な評価により、MetaEmbedが32Bパラメータのモデルに堅牢にスケーリングしながら、最先端の検索性能を達成することが確認されました。

English

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval, where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.

MetaEmbed: 柔軟な後期相互作用によるテスト時マルチモーダル検索のスケーリング

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

要旨

Support