LLaVE:基於難度加權對比學習的大型語言與視覺嵌入模型
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
March 4, 2025
作者: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
cs.AI
摘要
通用多模態嵌入模型在交錯圖文檢索、多模態RAG(檢索增強生成)以及多模態聚類等任務中扮演著關鍵角色。然而,我們的實證結果顯示,基於現有LMM(大型多模態模型)並採用標準InfoNCE損失訓練的嵌入模型,在正負樣本對的相似度分佈上存在高度重疊,這使得有效區分困難負樣本對變得頗具挑戰。為解決這一問題,我們提出了一個簡潔而高效的框架,該框架根據樣本對的區分難度動態提升嵌入模型對負樣本對的表示學習能力。在此框架內,我們訓練了一系列名為LLaVE的模型,並在涵蓋4大元任務及36個數據集的MMEB基準上進行了評估。實驗結果表明,LLaVE建立了更強的基準線,實現了當前最先進(SOTA)的性能,同時展現出優異的可擴展性和效率。具體而言,LLaVE-2B超越了之前的7B SOTA模型,而LLaVE-7B則進一步將性能提升了6.2個百分點。儘管LLaVE是在圖文數據上訓練的,但它能夠以零樣本方式泛化至文本-視頻檢索任務,並取得強勁表現,這展示了其在遷移至其他嵌入任務方面的巨大潛力。
English
Universal multimodal embedding models play a critical role in tasks such as
interleaved image-text retrieval, multimodal RAG, and multimodal clustering.
However, our empirical results indicate that existing LMM-based embedding
models trained with the standard InfoNCE loss exhibit a high degree of overlap
in similarity distribution between positive and negative pairs, making it
challenging to distinguish hard negative pairs effectively. To deal with this
issue, we propose a simple yet effective framework that dynamically improves
the embedding model's representation learning for negative pairs based on their
discriminative difficulty. Within this framework, we train a series of models,
named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks
and 36 datasets. Experimental results show that LLaVE establishes stronger
baselines that achieve state-of-the-art (SOTA) performance while demonstrating
strong scalability and efficiency. Specifically, LLaVE-2B surpasses the
previous SOTA 7B models, while LLaVE-7B achieves a further performance
improvement of 6.2 points. Although LLaVE is trained on image-text data, it can
generalize to text-video retrieval tasks in a zero-shot manner and achieve
strong performance, demonstrating its remarkable potential for transfer to
other embedding tasks.Summary
AI-Generated Summary