원샷으로 세그먼트 애니씽 모델 개인화하기

초록

대규모 데이터 사전 학습에 의해 주도되는 Segment Anything Model(SAM)은 강력하고 프롬프트 가능한 프레임워크로 입증되며, 세그멘테이션 모델에 혁신을 가져왔습니다. 그러나 이러한 일반성에도 불구하고, 특정 시각적 개념에 대해 SAM을 수동 프롬프트 없이 맞춤화하는 방법은 아직 충분히 탐구되지 않았습니다. 예를 들어, 다양한 이미지에서 애완견을 자동으로 세그먼트하는 것과 같은 작업이 이에 해당합니다. 본 논문에서는 SAM을 위한 훈련이 필요 없는 개인화 접근법인 PerSAM을 제안합니다. 단일 이미지와 참조 마스크만 주어지면, PerSAM은 위치 사전 정보를 통해 대상 개념을 지역화하고, 대상 유도 주의, 대상 의미론적 프롬프트, 그리고 계단식 사후 정제라는 세 가지 기술을 통해 다른 이미지나 비디오에서 이를 세그먼트합니다. 이를 통해 우리는 어떠한 훈련도 없이 SAM을 개인용으로 효과적으로 적용합니다. 마스크 모호성을 더욱 완화하기 위해, 우리는 효율적인 원샷 미세 조정 변형인 PerSAM-F를 제시합니다. 전체 SAM을 고정한 상태에서, 우리는 다중 스케일 마스크를 위한 두 개의 학습 가능한 가중치를 도입하여 단 10초 내에 2개의 매개변수만 훈련함으로써 성능을 향상시킵니다. 우리의 효율성을 입증하기 위해, 우리는 개인화 평가를 위한 새로운 세그먼테이션 데이터셋인 PerSeg를 구축하고, 비디오 객체 세그먼테이션에서 경쟁력 있는 성능으로 우리의 방법을 테스트합니다. 또한, 우리의 접근법은 DreamBooth를 강화하여 Stable Diffusion을 텍스트-이미지 생성에 개인화할 수 있으며, 이는 배경 간섭을 제거하여 대상 외관 학습을 개선합니다. 코드는 https://github.com/ZrrSkywalker/Personalize-SAM에서 공개되었습니다.

English

Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM

원샷으로 세그먼트 애니씽 모델 개인화하기

Personalize Segment Anything Model with One Shot

초록

Support