arXiv: 2511.13714v1

UnSAMv2:自監督學習實現任意粒度下的萬物分割

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

November 17, 2025
作者: Junwei Yu, Trevor Darrell, XuDong Wang
cs.CVcs.CVcs.AIcs.LGcs.CV

摘要

分段任意模型(SAM)系列已成為廣泛採用的視覺基礎模型,但其控制分割粒度的能力仍顯不足。用戶通常需要手動精修結果——通過添加更多提示或從預生成掩碼中選擇——以達到所需的細節水平。這一過程往往具有模糊性,因為同一提示可能對應多個合理的掩碼,而收集所有粒度上的密集註釋成本過高,使得監督式解決方案不可行。為解決這一限制,我們引入了UnSAMv2,該模型無需人工註釋即可實現任意粒度的分割。UnSAMv2通過發現豐富的掩碼-粒度對,並引入一種新穎的粒度控制嵌入,擴展了UnSAM的分治策略,從而實現了對分割尺度的精確、連續控制。值得注意的是,僅使用6K張未標記圖像和0.02%的額外參數,UnSAMv2便顯著提升了SAM-2的性能,在交互式、全圖像及視頻分割任務中實現了任意粒度的分割。在超過11個基準上的評估顯示,UnSAMv2改善了$\text{NoC}_{90}$(5.69 $\rightarrow$ 4.75)、1-IoU(58.0 $\rightarrow$ 73.1)和$\text{AR}_{1000}$(49.6 $\rightarrow$ 68.3),表明少量未標記數據結合粒度感知的自監督學習方法,能夠釋放視覺基礎模型的潛力。
English
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
PDFNovember 18, 2025