EmerDiff：扩散模型中新兴的像素级语义知识

摘要

最近，扩散模型因其在语义分割任务中出色的迁移能力而受到越来越多的研究关注。然而，利用扩散模型生成细粒度分割掩模通常需要在带标注数据集上进行额外训练，这使得预训练的扩散模型单独是否理解其生成图像的语义关系尚不清楚。为了解决这个问题，我们利用从稳定扩散（SD）中提取的语义知识，旨在开发一种能够生成细粒度分割地图的图像分割器，而无需进行任何额外训练。主要困难在于语义上有意义的特征图通常仅存在于空间维度较低的层中，这在直接从这些特征图中提取像素级语义关系方面构成挑战。为了克服这一问题，我们的框架通过利用SD的生成过程确定图像像素与低维特征图的空间位置之间的语义对应关系，并利用这些关系构建图像分辨率的分割地图。在大量实验中，生成的分割地图被证明具有良好的描绘和捕捉图像细节部分的能力，表明扩散模型中存在高度准确的像素级语义知识。

English

Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.

EmerDiff：扩散模型中新兴的像素级语义知识

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

摘要

Support