모달리티 장벽을 넘어서: 멀티모달 LLM을 통한 범용 임베딩 학습

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pre-training, CLIP) 프레임워크는 특히 이미지-텍스트 검색 및 클러스터링에서 다중모달 표현 학습을 위한 널리 사용되는 접근법으로 자리 잡았습니다. 그러나 CLIP의 효율성은 세 가지 주요 한계에 의해 제약받고 있습니다: (1) 텍스트 토큰 단축, (2) 독립적인 이미지-텍스트 인코딩, 그리고 (3) 단어 집합(bag-of-words) 행동으로 인한 구성성 부족. 최근의 다중모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)은 일반화된 시각-언어 이해에서 상당한 진전을 보여주었지만, 전이 가능한 다중모달 표현을 학습하는 데 있어 그 잠재력은 아직 충분히 탐구되지 않았습니다. 본 연구에서는 UniME(Universal Multimodal Embedding)라는 새로운 두 단계 프레임워크를 제안합니다. 이 프레임워크는 MLLMs를 활용하여 다양한 다운스트림 작업을 위한 판별적 표현을 학습합니다. 첫 번째 단계에서는 강력한 LLM 기반 교사 모델로부터 텍스트 판별 지식 증류를 수행하여 MLLM의 언어 구성 요소의 임베딩 능력을 강화합니다. 두 번째 단계에서는 판별적 표현 학습을 더욱 발전시키기 위해 하드 네거티브 강화 명령어 튜닝을 도입합니다. 구체적으로, 우리는 먼저 거짓 네거티브 오염을 완화한 후 각 배치 내에서 인스턴스당 여러 하드 네거티브를 샘플링하여 모델이 어려운 샘플에 집중하도록 합니다. 이 접근법은 판별력을 향상시킬 뿐만 아니라 다운스트림 작업에서 명령어 수행 능력도 강화합니다. 우리는 MMEB 벤치마크와 짧은 및 긴 캡션 검색, 구성적 검색을 포함한 여러 검색 작업에서 광범위한 실험을 수행했습니다. 결과는 UniME가 모든 작업에서 일관된 성능 향상을 달성하며, 우수한 판별 및 구성 능력을 보여줌을 입증합니다.

English

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

모달리티 장벽을 넘어서: 멀티모달 LLM을 통한 범용 임베딩 학습

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

초록

Support