UnSAMv2: 자기 지도 학습을 통한 임의의 세분성에서의 모든 것 분할 가능

초록

Segment Anything Model(SAM) 계열은 널리 채택된 비전 기반 모델이지만, 세분화 세밀도를 제어하는 능력은 여전히 제한적이다. 사용자는 원하는 수준의 세부 사항을 달성하기 위해 수동으로 결과를 개선해야 하는 경우가 많다. 이는 더 많은 프롬프트를 추가하거나 사전 생성된 마스크 중에서 선택함으로써 이루어진다. 이 과정은 모호할 수 있는데, 동일한 프롬프트가 여러 개의 타당한 마스크에 해당할 수 있으며, 모든 세밀도에 걸쳐 조밀한 주석을 수집하는 것은 비용이 너무 많이 들어 감독된 솔루션을 실행하기 어렵기 때문이다. 이러한 한계를 해결하기 위해, 우리는 인간의 주석 없이도 어떤 세밀도에서도 세분화를 가능하게 하는 UnSAMv2를 소개한다. UnSAMv2는 UnSAM의 분할 정복 전략을 확장하여 풍부한 마스크-세밀도 쌍을 발견하고, 세분화 스케일에 대한 정밀하고 연속적인 제어를 가능하게 하는 새로운 세밀도 제어 임베딩을 도입한다. 놀랍게도, 단 6,000개의 레이블 없는 이미지와 0.02%의 추가 파라미터만으로 UnSAMv2는 SAM-2를 크게 향상시켜, 인터랙티브, 전체 이미지, 비디오 세분화 작업에서 어떤 세밀도에서도 세분화를 가능하게 한다. 11개 이상의 벤치마크에서 평가된 UnSAMv2는 $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), 그리고 $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3)을 개선하여, 세밀도 인식 자기 감독 학습 방법과 함께 소량의 레이블 없는 데이터가 비전 기반 모델의 잠재력을 발휘할 수 있음을 보여준다.

English

The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

UnSAMv2: 자기 지도 학습을 통한 임의의 세분성에서의 모든 것 분할 가능

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

초록

Support