SLiMe: 나처럼 분할하라

초록

대규모 시각-언어 모델(예: Stable Diffusion, SD)을 활용하여 이미지 편집, 이미지 대응, 3D 형태 생성 등 다양한 하위 작업에서 상당한 진전이 이루어졌다. 이러한 발전에 영감을 받아, 우리는 SLiMe를 제안하여 단 하나의 주석 처리된 샘플만으로도 원하는 세분화 수준에서 이미지를 분할하는 데 이러한 대규모 시각-언어 모델을 활용하는 방법을 탐구한다. SLiMe는 이 문제를 최적화 작업으로 설정한다. 구체적으로, 단일 훈련 이미지와 그 분할 마스크가 주어지면, 먼저 SD 사전 모델로부터 "가중 누적 자기 주의 맵"을 포함한 주의 맵을 추출한다. 그런 다음, 추출된 주의 맵을 사용하여 Stable Diffusion의 텍스트 임베딩을 최적화하여 각 임베딩이 훈련 이미지의 단일 분할 영역에 대해 학습하도록 한다. 이러한 학습된 임베딩은 주의 맵에서 분할 영역을 강조하며, 이를 통해 분할 맵을 도출할 수 있다. 이로 인해 SLiMe는 단 하나의 예시만으로도 추론 과정에서 실제 세계의 이미지를 훈련 이미지의 분할 영역 세분화 수준으로 분할할 수 있다. 또한, 추가 훈련 데이터를 활용할 수 있는 경우(즉, 소수 샷), SLiMe의 성능이 향상된다. 우리는 다양한 설계 요소를 검토한 지식이 풍부한 실험을 수행하여 SLiMe가 기존의 단일 샷 및 소수 샷 분할 방법들을 능가함을 보여주었다.

English

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

SLiMe: 나처럼 분할하라

SLiMe: Segment Like Me

초록

Support