INSID3: DINOv3を用いた学習不要なインコンテキストセグメンテーション

要旨

文脈内セグメンテーション（ICS）は、注釈付きの視覚的例を1つ与えられたときに、物体、部分、または個人化されたインスタンスなどの任意の概念をセグメント化することを目的としている。既存の研究は、(i) 視覚基盤モデル（VFM）のファインチューニングに依存するもの（ドメイン内の結果は改善するが一般化性能を損なう）か、(ii) 複数の凍結されたVFMを組み合わせるもの（一般化性能は保持するが、アーキテクチャの複雑さと固定されたセグメンテーションの粒度をもたらす）のいずれかに分かれる。本研究では、ミニマリストの視点からICSを再検討し、単一の自己教師ありバックボーンが、一切の教師信号や補助モデルなしに、意味的マッチングとセグメンテーションの両方を実現できるかを問う。我々は、DINOv3から得られた大規模な密な自己教師あり特徴が、強い空間構造と意味的対応を示すことを明らかにする。我々はINSID3を提案する。これは、文脈内の例から、凍結されたDINOv3の特徴のみを用いて様々な粒度で概念をセグメント化する、訓練不要な手法である。INSID3は、ワンショット意味セグメンテーション、部分セグメンテーション、個人化セグメンテーションにおいて、従来手法をmIoUで+7.5 %上回る最先端の結果を達成し、パラメータ数は3分の1で、マスクやカテゴリレベルの教師信号を一切必要としない。コードはhttps://github.com/visinf/INSID3 で公開されている。

English

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .

INSID3: DINOv3を用いた学習不要なインコンテキストセグメンテーション

INSID3: Training-Free In-Context Segmentation with DINOv3

要旨

Support