UniME-V2：作为评判者的多模态大模型，助力通用多模态嵌入学习

摘要

通用多模态嵌入模型是各类任务的基础。现有方法通常通过测量查询-候选对的相似性来进行批内负样本挖掘。然而，这些方法往往难以捕捉候选者间细微的语义差异，且负样本缺乏多样性。此外，嵌入在区分假负样本和难负样本时表现出有限的判别能力。本文利用多模态大模型（MLLMs）的高级理解能力来增强表示学习，并提出了一种新颖的通用多模态嵌入模型（UniME-V2）。我们的方法首先通过全局检索构建潜在难负样本集。随后，引入MLLM-as-a-Judge机制，利用MLLMs评估查询-候选对的语义对齐度，并生成软语义匹配分数。这些分数作为难负样本挖掘的基础，减轻了假负样本的影响，并能够识别出多样且高质量的难负样本。此外，语义匹配分数被用作软标签，以缓解严格的一对一映射约束。通过将相似度矩阵与软语义匹配分数矩阵对齐，模型能够学习候选者间的语义区分，显著提升其判别能力。为进一步提升性能，我们提出了UniME-V2-Reranker，这是一个通过联合成对和列表优化方法在我们挖掘的难负样本上训练的重新排序模型。我们在MMEB基准和多个检索任务上进行了全面实验，结果表明我们的方法在所有任务上平均达到了最先进的性能。

English

Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

UniME-V2：作为评判者的多模态大模型，助力通用多模态嵌入学习

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

摘要

Support