突破模态壁垒：基于多模态大语言模型的通用嵌入学习

摘要

对比语言-图像预训练（CLIP）框架已成为多模态表示学习中的一种广泛应用方法，尤其在图像-文本检索和聚类任务中表现突出。然而，其有效性受到三个关键限制的制约：(1) 文本标记截断，(2) 孤立的图像-文本编码，以及(3) 因词袋行为导致的组合性不足。尽管最近的多模态大语言模型（MLLMs）在广义视觉-语言理解方面取得了显著进展，但其在学习可迁移多模态表示方面的潜力仍未被充分挖掘。在本研究中，我们提出了UniME（通用多模态嵌入），一种新颖的两阶段框架，该框架利用MLLMs为多样化的下游任务学习判别性表示。第一阶段，我们通过从强大的基于LLM的教师模型进行文本判别知识蒸馏，以增强MLLM语言组件的嵌入能力。第二阶段，我们引入了硬负样本增强的指令微调，以进一步推进判别性表示学习。具体而言，我们首先缓解假负样本污染，然后在每批次中为每个实例采样多个硬负样本，迫使模型关注具有挑战性的样本。这一方法不仅提升了判别力，还增强了模型在下游任务中的指令遵循能力。我们在MMEB基准及多项检索任务上进行了广泛实验，包括短长描述检索和组合检索。结果表明，UniME在所有任务上均实现了性能的持续提升，展现出卓越的判别性和组合能力。

English

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

突破模态壁垒：基于多模态大语言模型的通用嵌入学习

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

摘要

Support