LLaVE: 경도 가중 대조 학습을 적용한 대규모 언어 및 비전 임베딩 모델

초록

범용 멀티모달 임베딩 모델은 인터리브된 이미지-텍스트 검색, 멀티모달 RAG, 멀티모달 클러스터링과 같은 작업에서 중요한 역할을 합니다. 그러나 우리의 실험 결과에 따르면, 표준 InfoNCE 손실로 학습된 기존의 LMM 기반 임베딩 모델은 긍정적 쌍과 부정적 쌍 간의 유사성 분포가 높은 수준으로 겹치는 문제를 보여, 어려운 부정적 쌍을 효과적으로 구분하는 데 어려움이 있습니다. 이 문제를 해결하기 위해, 우리는 부정적 쌍의 구별 난이도에 기반하여 임베딩 모델의 표현 학습을 동적으로 개선하는 간단하지만 효과적인 프레임워크를 제안합니다. 이 프레임워크 내에서 우리는 LLaVE라는 일련의 모델을 학습시키고, 4개의 메타 작업과 36개의 데이터셋을 포함하는 MMEB 벤치마크에서 평가합니다. 실험 결과, LLaVE는 최첨단(SOTA) 성능을 달성하면서도 강력한 확장성과 효율성을 보여주는 더 강력한 기준선을 수립합니다. 특히, LLaVE-2B는 이전 SOTA 7B 모델을 능가하며, LLaVE-7B는 6.2포인트의 추가 성능 향상을 달성합니다. LLaVE는 이미지-텍스트 데이터로 학습되었지만, 제로샷 방식으로 텍스트-비디오 검색 작업에 일반화할 수 있고 강력한 성능을 보여주어, 다른 임베딩 작업으로의 전이 가능성이 뛰어남을 입증합니다.

English

Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

LLaVE: 경도 가중 대조 학습을 적용한 대규모 언어 및 비전 임베딩 모델

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

초록

Support