ChatPaper.aiChatPaper

VIRTUE:视觉交互式文本图像通用嵌入器

VIRTUE: Visual-Interactive Text-Image Universal Embedder

October 1, 2025
作者: Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji
cs.AI

摘要

多模态表示学习模型已在复杂任务中展现出卓越性能,而视觉-语言模型(VLMs)的整合进一步赋予了嵌入模型指令跟随能力。然而,现有嵌入模型缺乏视觉交互能力,无法根据用户指定兴趣区域(如点选、边界框、掩码)进行操作,这一特性在生成模型中已被探索,以拓宽其人机交互适用性。为嵌入模型配备视觉交互能力,不仅能够解锁基于用户意图局部定位的新应用领域——这一方向尚未被充分探索,还能使模型学习图像中的实体级信息,从而补充其在传统嵌入任务中的全局表示。本文提出了一种新颖的视觉交互式文本-图像通用嵌入器(VIRTUE),它将分割模型与视觉-语言模型的能力扩展至表示学习领域。在VIRTUE中,分割模型能够处理精确定位图像特定区域的视觉提示,从而使嵌入器能够更精确地处理复杂和模糊的场景。为评估VIRTUE的视觉交互能力,我们引入了一个大规模分割与场景描述检索(SCaR)基准,包含100万样本,旨在通过联合考虑特定对象实体与图像场景来检索文本描述。VIRTUE在36项通用MMEB任务(提升3.1%-8.5%)及五项视觉交互SCaR任务(提升15.2%-20.3%)中均实现了显著的性能提升,持续保持领先水平。
English
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
PDF62October 3, 2025