INSID3:基於DINOv3的免訓練上下文分割技術
INSID3: Training-Free In-Context Segmentation with DINOv3
March 30, 2026
作者: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth
cs.AI
摘要
上下文分割(ICS)的目标是在给定一个标注视觉示例的情况下,分割任意概念(如物体、部件或个性化实例)。现有研究要么依赖(i)微调视觉基础模型(VFM),虽能提升域内效果但会损害泛化能力;要么采用(ii)组合多个冻结VFM的方法,虽能保持泛化性但会导致架构复杂性和固定的分割粒度。我们以极简视角重新审视ICS并提出:单个自监督骨干网络能否在无需任何监督或辅助模型的情况下,同时支持语义匹配和分割?我们证明,DINOv3生成的规模化密集自监督特征具有强空间结构和语义对应性。我们提出INSID3——一种仅基于冻结DINOv3特征、无需训练即可根据上下文示例实现多粒度概念分割的方法。INSID3在一次性语义分割、部件分割和个性化分割任务中均达到最先进水平,mIoU指标较前人工作提升7.5%,同时参数量减少3倍且无需掩码或类别级监督。代码已开源:https://github.com/visinf/INSID3。
English
In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .