VIRTUE: 시각-상호작용 텍스트-이미지 범용 임베더

초록

멀티모달 표현 학습 모델은 복잡한 작업에서 성공적으로 작동해 왔으며, 비전-언어 모델(VLMs)의 통합은 지시를 따르는 기능을 갖춘 임베딩 모델을 더욱 가능하게 했습니다. 그러나 기존의 임베딩 모델은 사용자로부터 관심 영역(예: 점, 바운딩 박스, 마스크)을 지정할 수 있는 시각적 상호작용 기능이 부족하며, 이는 생성 모델에서 인간과의 상호작용 적용 범위를 넓히기 위해 탐구되어 왔습니다. 임베딩 모델에 시각적 상호작용 기능을 추가하는 것은 사용자 의도의 지역적 근거를 통해 새로운 응용 프로그램을 가능하게 할 뿐만 아니라, 전통적인 임베딩 작업을 위한 전역적 표현을 보완하기 위해 이미지 내의 개체 수준 정보를 학습할 수 있게 합니다. 본 논문에서는 세그멘테이션 모델과 비전-언어 모델의 기능을 표현 학습 영역으로 확장하는 새로운 Visual-InteRactive Text-Image Universal Embedder(VIRTUE)를 제안합니다. VIRTUE에서 세그멘테이션 모델은 이미지 내 특정 영역을 정확히 지정하는 시각적 프롬프트를 처리할 수 있어, 임베더가 복잡하고 모호한 시나리오를 더 정밀하게 처리할 수 있게 합니다. VIRTUE의 시각적 상호작용 능력을 평가하기 위해, 특정 객체와 이미지 장면을 함께 고려하여 텍스트 캡션을 검색하는 대규모 Segmentation-and-Scene Caption Retrieval(SCaR) 벤치마크를 100만 개의 샘플로 구성했습니다. VIRTUE는 36개의 범용 MMEB(3.1%-8.5%) 및 5개의 시각적 상호작용 SCaR(15.2%-20.3%) 작업에서 지속적으로 최첨단 성능을 달성하며 상당한 개선을 보였습니다.

English

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.

VIRTUE: 시각-상호작용 텍스트-이미지 범용 임베더

VIRTUE: Visual-Interactive Text-Image Universal Embedder

초록

Support