LLaVE: ハードネス重み付きコントラスティブ学習による大規模言語・視覚埋め込みモデル

要旨

ユニバーサルマルチモーダル埋め込みモデルは、画像とテキストの交互検索、マルチモーダルRAG、マルチモーダルクラスタリングなどのタスクにおいて重要な役割を果たします。しかし、我々の実証結果によると、標準的なInfoNCE損失で訓練された既存のLMMベースの埋め込みモデルは、正例ペアと負例ペアの類似度分布が高度に重複しており、ハードネガティブペアを効果的に識別することが困難です。この問題に対処するため、我々は、識別の難易度に基づいて負例ペアに対する埋め込みモデルの表現学習を動的に改善する、シンプルかつ効果的なフレームワークを提案します。このフレームワーク内で、我々はLLaVEと名付けた一連のモデルを訓練し、4つのメタタスクと36のデータセットをカバーするMMEBベンチマークで評価しました。実験結果は、LLaVEが最先端（SOTA）の性能を達成する強力なベースラインを確立し、高いスケーラビリティと効率性を示すことを明らかにしています。具体的には、LLaVE-2Bは以前のSOTAである7Bモデルを上回り、LLaVE-7Bはさらに6.2ポイントの性能向上を達成しました。LLaVEは画像とテキストのデータで訓練されていますが、ゼロショット方式でテキストと動画の検索タスクに一般化し、強力な性能を発揮することができ、他の埋め込みタスクへの転移における顕著な潜在能力を示しています。

English

Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

LLaVE: ハードネス重み付きコントラスティブ学習による大規模言語・視覚埋め込みモデル

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

要旨

Support