VIRTUE:視覺互動式文本圖像通用嵌入器
VIRTUE: Visual-Interactive Text-Image Universal Embedder
October 1, 2025
作者: Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji
cs.AI
摘要
多模態表徵學習模型在複雜任務中已展現出卓越的表現,而視覺-語言模型(VLMs)的整合更進一步賦予了嵌入模型指令跟隨的能力。然而,現有的嵌入模型缺乏視覺互動功能,無法從用戶處指定感興趣的區域(例如點選、邊界框、遮罩),這些功能在生成模型中已被探索,以擴展其人機互動的適用性。為嵌入模型配備視覺互動能力,不僅能開啟基於用戶意圖局部定位的新應用領域——這一領域目前尚未被探索——還能讓模型學習圖像中的實體級信息,從而補充其在傳統嵌入任務中的全局表徵。本文提出了一種新穎的視覺互動文本-圖像通用嵌入器(VIRTUE),它將分割模型和視覺-語言模型的能力擴展至表徵學習領域。在VIRTUE中,分割模型能夠處理視覺提示,精確定位圖像中的特定區域,從而使嵌入器能更精確地處理複雜且模糊的場景。為評估VIRTUE的視覺互動能力,我們引入了一個大規模的分割與場景描述檢索(SCaR)基準,包含100萬個樣本,旨在通過同時考慮特定物體和圖像場景的實體來檢索文本描述。VIRTUE在36個通用MMEB任務(提升3.1%-8.5%)和五個視覺互動SCaR任務(提升15.2%-20.3%)中持續實現了最先進的性能,並取得了顯著的改進。
English
Multimodal representation learning models have demonstrated successful
operation across complex tasks, and the integration of vision-language models
(VLMs) has further enabled embedding models with instruction-following
capabilities. However, existing embedding models lack visual-interactive
capabilities to specify regions of interest from users (e.g., point, bounding
box, mask), which have been explored in generative models to broaden their
human-interactive applicability. Equipping embedding models with visual
interactions not only would unlock new applications with localized grounding of
user intent, which remains unexplored, but also enable the models to learn
entity-level information within images to complement their global
representations for conventional embedding tasks. In this paper, we propose a
novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends
the capabilities of the segmentation model and the vision-language model to the
realm of representation learning. In VIRTUE, the segmentation model can process
visual prompts that pinpoint specific regions within an image, thereby enabling
the embedder to handle complex and ambiguous scenarios more precisely. To
evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale
Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples
that aims to retrieve the text caption by jointly considering the entity with a
specific object and image scene. VIRTUE consistently achieves a
state-of-the-art performance with significant improvements across 36 universal
MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.