arXiv: 2511.13714v1
UnSAMv2:自监督学习赋能任意粒度下的通用图像分割
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
November 17, 2025
作者: Junwei Yu, Trevor Darrell, XuDong Wang
cs.CVcs.CVcs.AIcs.LGcs.CV
摘要
“分割一切模型”(Segment Anything Model, SAM)系列已成为广泛采用的视觉基础模型,但其在控制分割粒度方面的能力仍显不足。用户通常需要手动优化结果——通过添加更多提示或从预生成的掩码中选择——以达到所需的细节水平。这一过程往往具有模糊性,因为同一提示可能对应多个合理的掩码,而跨所有粒度收集密集标注的成本过高,使得监督式解决方案难以实施。针对这一局限,我们提出了UnSAMv2,它能够在无需人工标注的情况下实现任意粒度的分割。UnSAMv2扩展了UnSAM的分而治之策略,通过发现丰富的掩码-粒度对,并引入一种新颖的粒度控制嵌入,实现了对分割尺度的精确、连续控制。值得注意的是,仅需6,000张未标注图像和0.02%的额外参数,UnSAMv2便显著提升了SAM-2的性能,在交互式、全图像及视频分割任务中实现了任意粒度的分割。在超过11个基准测试中,UnSAMv2在$\text{NoC}_{90}$(5.69→4.75)、1-IoU(58.0→73.1)和$\text{AR}_{1000}$(49.6→68.3)等指标上均有所提升,表明少量未标注数据结合粒度感知的自监督学习方法,能够充分释放视觉基础模型的潜力。
English
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.