EmerDiff：擴散模型中新興的像素級語義知識

摘要

最近，擴散模型在語義分割任務中的出色轉移能力引起了越來越多的研究關注。然而，使用擴散模型生成細粒度分割遮罩通常需要在標註數據集上進行額外訓練，這使得預先訓練的擴散模型是否單獨理解其生成圖像的語義關係程度不明確。為了解答這個問題，我們利用從穩定擴散（SD）中提取的語義知識，旨在開發一個能夠生成細粒度分割地圖而無需進行任何額外訓練的圖像分割器。主要困難在於，具有語義意義的特徵圖通常僅存在於空間維度較低的層中，這直接從這些特徵圖中提取像素級語義關係構成了挑戰。為了克服這個問題，我們的框架通過利用SD的生成過程識別圖像像素與低維特徵圖的空間位置之間的語義對應，並將其用於構建圖像分辨率的分割地圖。在大量實驗中，所產生的分割地圖被證明具有良好的劃分和捕捉圖像的細節部分，表明擴散模型中存在高度準確的像素級語義知識。

English

Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.

EmerDiff：擴散模型中新興的像素級語義知識

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

摘要

Support