ChatPaper.aiChatPaper

检索与分割:少量样本能否弥合开放词汇分割中的监督鸿沟?

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

February 26, 2026
作者: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
cs.AI

摘要

开放词汇分割(OVS)将视觉语言模型(VLM)的零样本识别能力扩展至像素级预测,实现了基于文本提示的任意类别分割。尽管近期取得进展,但由于VLM训练采用的粗粒度图像级监督与自然语言的语义模糊性两大挑战,OVS仍落后于全监督方法。我们通过引入少样本设置来解决这些局限,该设置利用带有像素标注图像的支持集来增强文本提示。基于此,我们提出一种检索增强的测试时适配器,通过融合文本和视觉支持特征来学习轻量级的单图像分类器。与依赖后期手工融合的现有方法不同,我们的方法实现了基于查询的实时学习式融合,达成了模态间更强的协同效应。该方法支持持续扩展的支持集,并适用于个性化分割等细粒度任务。实验表明,我们在保持开放词汇能力的同时,显著缩小了零样本与监督分割之间的性能差距。
English
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
PDF42February 28, 2026