INSID3：基于DINOv3的免训练上下文分割方法

摘要

上下文分割（ICS）的目标是在给定一个标注视觉示例的情况下，对任意概念（如物体、部件或个性化实例）进行分割。现有方法依赖两种路径：(i) 微调视觉基础模型（VFMs），虽能提升域内效果但会损害泛化能力；(ii) 组合多个冻结的VFMs，虽能保持泛化性但会导致架构复杂化和固定的分割粒度。我们以极简视角重新审视ICS并提出：单个自监督骨干网络能否在无需任何监督或辅助模型的情况下，同时支持语义匹配与分割？我们发现，DINOv3生成的规模化稠密自监督特征具有强空间结构和语义对应性。据此我们提出INSID3——一种仅基于冻结DINOv3特征即可实现多粒度概念分割的无训练方法。INSID3在一次性语义分割、部件分割和个性化分割任务中均达到最先进水平，mIoU指标较前人工作提升7.5%，同时参数量减少3倍且无需掩码或类别级监督。代码已开源：https://github.com/visinf/INSID3。

English

In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at https://github.com/visinf/INSID3 .