검색 및 분할: 소수의 예시만으로 개방형 어휘 분할의 감독 격차를 해소할 수 있을까?

초록

오픈-보커뷸러리 분할(Open-Vocabulary Segmentation, OVS)은 시각-언어 모델(Vision-Language Models, VLMs)의 제로-샷 인식 능력을 픽셀 수준 예측으로 확장하여 텍스트 프롬프트로 지정된 임의의 범주에 대한 분할을 가능하게 합니다. 최근 발전에도 불구하고, OVS는 VLM 훈련에 사용된 거친 이미지 수준 감독과 자연어의 의미적 모호성이라는 두 가지 과제로 인해 완전 지도 학습 방식보다 성능이 낮습니다. 우리는 픽셀 주석이 달린 이미지로 구성된 지원 세트(Support Set)를 텍스트 프롬프트에 추가하는 퓨-샷(Few-Shot) 설정을 도입하여 이러한 한계를 해결합니다. 이를 기반으로, 텍스트 및 시각 지원 특징을 융합하여 경량의 이미지별 분류기를 학습하는 검색 증강 테스트-타임 어댑터(Retrieval-Augmented Test-Time Adapter)를 제안합니다. 후기 단계의 수작업 융합에 의존하는 기존 방법과 달리, 우리의 접근 방식은 학습된 쿼리별 융합을 수행하여 양식 간의 더 강력한 시너지를 달성합니다. 이 방법은 지속적으로 확장 가능한 지원 세트를 활용하며, 개인화된 분할과 같은 세분화된 작업에도 적용 가능합니다. 실험 결과, 우리의 방법이 오픈-보커뷸러리 능력을 유지하면서 제로-샷 분할과 지도 분할 간의 성능 격차를 크게 좁힘을 보여줍니다.

English

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.