SLiMe：像我一樣分段

摘要

利用大型視覺語言模型（如穩定擴散（SD）），在圖像編輯、圖像對應和3D形狀生成等多個下游任務上取得了顯著進展。受到這些進展的啟發，我們探索了如何利用這些龐大的視覺語言模型，通過提出SLiMe，以使用盡可能少的一個標註樣本來對圖像進行任意粒度的分割。SLiMe將這個問題框架化為一個優化任務。具體而言，給定一個訓練圖像及其分割遮罩，我們首先從SD先前提取注意力地圖，包括我們的新穎的“加權累積自注意力地圖”。然後，使用提取的注意力地圖，對穩定擴散的文本嵌入進行優化，使得每個嵌入都學習訓練圖像中的單個分割區域。這些學習的嵌入然後在注意力地圖中突出顯示分割區域，進而可用於推導分割地圖。這使得SLiMe能夠在推斷過程中對任何現實世界圖像進行分割，使用僅一個示例來自訓練圖像中的分割區域的粒度。此外，在有額外訓練數據可用時，即少樣本情況下，可以提高SLiMe的性能。我們進行了一系列知識豐富的實驗，研究了各種設計因素，並展示了SLiMe優於其他現有的一次樣本和少樣本分割方法。

English

Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.

SLiMe：像我一樣分段

SLiMe: Segment Like Me

摘要

Support