ChatPaper.aiChatPaper

UniME-V2:作為評判者的多模態大語言模型,用於通用多模態嵌入學習

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

October 15, 2025
作者: Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, Lidong Bing
cs.AI

摘要

通用多模態嵌入模型是各類任務的基礎。現有方法通常通過測量查詢-候選對的相似性來進行批次內負樣本挖掘。然而,這些方法往往難以捕捉候選樣本間的細微語義差異,且負樣本缺乏多樣性。此外,嵌入在區分假負樣本和困難負樣本時表現出有限的辨別能力。本文利用多模態大語言模型(MLLMs)的高級理解能力來增強表示學習,提出了一種新穎的通用多模態嵌入模型(UniME-V2)。我們的方法首先通過全局檢索構建潛在的困難負樣本集。接著引入MLLM-as-a-Judge機制,利用MLLMs評估查詢-候選對的語義對齊並生成軟語義匹配分數。這些分數作為困難負樣本挖掘的基礎,減輕了假負樣本的影響,並能識別出多樣化、高質量的困難負樣本。此外,語義匹配分數被用作軟標籤,以緩解嚴格的一對一映射約束。通過將相似度矩陣與軟語義匹配分數矩陣對齊,模型能夠學習候選樣本間的語義區別,顯著提升其辨別能力。為了進一步提升性能,我們提出了UniME-V2-Reranker,這是一個通過聯合成對和列表優化方法在我們挖掘的困難負樣本上訓練的重新排序模型。我們在MMEB基準和多個檢索任務上進行了全面實驗,結果表明我們的方法在所有任務上平均達到了最先進的性能。
English
Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.
PDF112October 16, 2025