モダリティの壁を打ち破る：マルチモーダルLLMによる普遍的な埋め込み学習

要旨

コントラスティブ言語-画像事前学習（CLIP）フレームワークは、特に画像-テキスト検索やクラスタリングにおいて、マルチモーダル表現学習の広く使われるアプローチとなっています。しかし、その有効性は3つの主要な制約によって制限されています：(1) テキストトークンの切り捨て、(2) 孤立した画像-テキストエンコーディング、(3) バッグオブワーズの挙動による構成性の欠如。最近のマルチモーダル大規模言語モデル（MLLMs）は、一般化された視覚-言語理解において大きな進歩を示していますが、転移可能なマルチモーダル表現を学習する可能性はまだ十分に探求されていません。本研究では、UniME（Universal Multimodal Embedding）を提案します。これは、MLLMsを活用して多様な下流タスクのための識別可能な表現を学習する新しい2段階フレームワークです。第1段階では、強力なLLMベースの教師モデルからテキストの識別知識を蒸留し、MLLMの言語コンポーネントの埋め込み能力を強化します。第2段階では、ハードネガティブを強化した指示チューニングを導入し、識別表現学習をさらに進めます。具体的には、最初に偽ネガティブの混入を軽減し、次に各バッチ内のインスタンスごとに複数のハードネガティブをサンプリングし、モデルに難しいサンプルに集中させるようにします。このアプローチは、識別力を向上させるだけでなく、下流タスクにおける指示追従能力も強化します。MMEBベンチマークおよび短いキャプション検索、長いキャプション検索、構成検索を含む複数の検索タスクで広範な実験を行いました。結果は、UniMEがすべてのタスクで一貫した性能向上を達成し、優れた識別能力と構成能力を示すことを実証しています。

English

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

モダリティの壁を打ち破る：マルチモーダルLLMによる普遍的な埋め込み学習

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

要旨

Support