检索与分割：少量样本能否弥合开放词汇分割中的监督鸿沟？

摘要

开放词汇分割（OVS）将视觉语言模型（VLM）的零样本识别能力扩展至像素级预测，实现了基于文本提示的任意类别分割。尽管近期取得进展，但由于VLM训练采用的粗粒度图像级监督与自然语言的语义模糊性两大挑战，OVS仍落后于全监督方法。我们通过引入少样本设置来解决这些局限，该设置利用带有像素标注图像的支持集来增强文本提示。基于此，我们提出一种检索增强的测试时适配器，通过融合文本和视觉支持特征来学习轻量级的单图像分类器。与依赖后期手工融合的现有方法不同，我们的方法实现了基于查询的实时学习式融合，达成了模态间更强的协同效应。该方法支持持续扩展的支持集，并适用于个性化分割等细粒度任务。实验表明，我们在保持开放词汇能力的同时，显著缩小了零样本与监督分割之间的性能差距。

English

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

检索与分割：少量样本能否弥合开放词汇分割中的监督鸿沟？

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

摘要

Support