INSID3：基於DINOv3的免訓練上下文分割技術

摘要

上下文分割（ICS）的目标是在给定一个标注视觉示例的情况下，分割任意概念（如物体、部件或个性化实例）。现有研究要么依赖（i）微调视觉基础模型（VFM），虽能提升域内效果但会损害泛化能力；要么采用（ii）组合多个冻结VFM的方法，虽能保持泛化性但会导致架构复杂性和固定的分割粒度。我们以极简视角重新审视ICS并提出：单个自监督骨干网络能否在无需任何监督或辅助模型的情况下，同时支持语义匹配和分割？我们证明，DINOv3生成的规模化密集自监督特征具有强空间结构和语义对应性。我们提出INSID3——一种仅基于冻结DINOv3特征、无需训练即可根据上下文示例实现多粒度概念分割的方法。INSID3在一次性语义分割、部件分割和个性化分割任务中均达到最先进水平，mIoU指标较前人工作提升7.5%，同时参数量减少3倍且无需掩码或类别级监督。代码已开源：https://github.com/visinf/INSID3。

English

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .

INSID3：基於DINOv3的免訓練上下文分割技術

INSID3: Training-Free In-Context Segmentation with DINOv3

摘要

Support