LLaVE：基於難度加權對比學習的大型語言與視覺嵌入模型

摘要

通用多模態嵌入模型在交錯圖文檢索、多模態RAG（檢索增強生成）以及多模態聚類等任務中扮演著關鍵角色。然而，我們的實證結果顯示，基於現有LMM（大型多模態模型）並採用標準InfoNCE損失訓練的嵌入模型，在正負樣本對的相似度分佈上存在高度重疊，這使得有效區分困難負樣本對變得頗具挑戰。為解決這一問題，我們提出了一個簡潔而高效的框架，該框架根據樣本對的區分難度動態提升嵌入模型對負樣本對的表示學習能力。在此框架內，我們訓練了一系列名為LLaVE的模型，並在涵蓋4大元任務及36個數據集的MMEB基準上進行了評估。實驗結果表明，LLaVE建立了更強的基準線，實現了當前最先進（SOTA）的性能，同時展現出優異的可擴展性和效率。具體而言，LLaVE-2B超越了之前的7B SOTA模型，而LLaVE-7B則進一步將性能提升了6.2個百分點。儘管LLaVE是在圖文數據上訓練的，但它能夠以零樣本方式泛化至文本-視頻檢索任務，並取得強勁表現，這展示了其在遷移至其他嵌入任務方面的巨大潛力。

English

Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

LLaVE：基於難度加權對比學習的大型語言與視覺嵌入模型

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

摘要

Support