SLiMe：像我一样分割

摘要

利用大型视觉-语言模型（如稳定扩散（SD）），在图像编辑、图像对应以及3D形状生成等多种下游任务中取得了重大进展。受到这些进展的启发，我们探索了如何利用这些庞大的视觉-语言模型，通过提出SLiMe，在任意所需粒度上对图像进行分割，只需使用一个标注样本。SLiMe将这一问题构建为一个优化任务。具体而言，给定一张训练图像及其分割蒙版，我们首先从SD先验中提取注意力图，包括我们的新颖的“加权累积自注意力图”。然后，利用提取的注意力图，优化稳定扩散的文本嵌入，使得每个文本嵌入都学习训练图像中的一个分割区域。这些学习到的嵌入然后在注意力图中突出显示分割区域，从而可以用来生成分割图。这使得SLiMe能够在推断过程中对任意实际图像进行分割，粒度与训练图像中的分割区域相匹配，仅需一个示例。此外，当有额外的训练数据可用时，即少样本情况，可以提高SLiMe的性能。我们进行了一系列富有知识的实验，考察了各种设计因素，并表明SLiMe优于其他现有的一次性和少样本分割方法。

English

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

SLiMe：像我一样分割

SLiMe: Segment Like Me

摘要

Support