取得とセグメンテーション：少数の例示でオープン語彙セグメンテーションの監督ギャップは埋まるのか？

要旨

オープン語彙セグメンテーション（OVS）は、視覚言語モデル（VLM）が持つゼロショット認識能力をピクセルレベル予測に拡張し、テキストプロンプトで指定された任意のカテゴリのセグメンテーションを可能にする。近年進展が見られるものの、OVSは完全教師あり手法に遅れを取っている。これは主に、（1）VLMの学習に用いられる画像レベルの大まかな教師信号と、（2）自然言語の意味的曖昧さ、という2つの課題に起因する。我々は、テキストプロンプトをピクセル注釈付き画像からなるサポートセットで補強する数ショット設定を導入し、これらの限界に取り組む。これを基盤として、テキストと視覚のサポート特徴を融合させることで、軽量な画像単位の分類器を学習する検索拡張型テスト時適応手法を提案する。従来手法が手作りの後期融合に依存するのに対し、本手法は学習に基づくクエリ単位の融合を実行し、モダリティ間のより強力な相乗効果を実現する。本手法は継続的に拡張可能なサポートセットに対応し、パーソナライズドセグメンテーションなどの細粒度タスクにも適用可能である。実験により、オープン語彙性を維持しつつ、ゼロショットセグメンテーションと教師ありセグメンテーションの性能差を大幅に縮小できることを示す。

English

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

取得とセグメンテーション：少数の例示でオープン語彙セグメンテーションの監督ギャップは埋まるのか？

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

要旨

Support