在文本到图像生成模型中的本地化和编辑知识
Localizing and Editing Knowledge in Text-to-Image Generative Models
October 20, 2023
作者: Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha
cs.AI
摘要
文本到图像扩散模型,如Stable-Diffusion和Imagen,已经在MS-COCO和其他生成基准上取得了空前的逼真质量,具有最先进的FID分数。给定一个标题,图像生成需要关于对象结构、风格、视角等属性的细粒度知识。这些信息在文本到图像生成模型中存储在哪里?在我们的论文中,我们解决了这个问题,并了解了关于不同视觉属性对应的知识如何存储在大规模文本到图像扩散模型中。我们为文本到图像模型调整了因果中介分析,并追踪了关于不同视觉属性的知识如何存储在扩散模型的(i)UNet和(ii)文本编码器中的各个(因果)组件中。特别地,我们表明,与生成大型语言模型不同,关于不同属性的知识并不局限于孤立的组件中,而是分布在条件UNet的一组组件中。这些组件集合通常对于不同的视觉属性是不同的。值得注意的是,我们发现,像Stable-Diffusion这样的公共文本到图像模型中的CLIP文本编码器在不同视觉属性之间只包含一个因果状态,这是对应于标题中属性的最后主题标记的第一个自注意层。这与其他语言模型中的因果状态形成鲜明对比,后者通常是中间MLP层。基于对文本编码器中仅有一个因果状态的观察,我们引入了一种快速的、无数据的模型编辑方法Diff-QuickFix,可以有效地编辑文本到图像模型中的概念。Diff-QuickFix可以在不到一秒的时间内编辑(消融)概念,提供了显著的1000倍加速和与现有微调编辑方法相媲美的编辑性能。
English
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have
achieved unprecedented quality of photorealism with state-of-the-art FID scores
on MS-COCO and other generation benchmarks. Given a caption, image generation
requires fine-grained knowledge about attributes such as object structure,
style, and viewpoint amongst others. Where does this information reside in
text-to-image generative models? In our paper, we tackle this question and
understand how knowledge corresponding to distinct visual attributes is stored
in large-scale text-to-image diffusion models. We adapt Causal Mediation
Analysis for text-to-image models and trace knowledge about distinct visual
attributes to various (causal) components in the (i) UNet and (ii) text-encoder
of the diffusion model. In particular, we show that unlike generative
large-language models, knowledge about different attributes is not localized in
isolated components, but is instead distributed amongst a set of components in
the conditional UNet. These sets of components are often distinct for different
visual attributes. Remarkably, we find that the CLIP text-encoder in public
text-to-image models such as Stable-Diffusion contains only one causal state
across different visual attributes, and this is the first self-attention layer
corresponding to the last subject token of the attribute in the caption. This
is in stark contrast to the causal states in other language models which are
often the mid-MLP layers. Based on this observation of only one causal state in
the text-encoder, we introduce a fast, data-free model editing method
Diff-QuickFix which can effectively edit concepts in text-to-image models.
DiffQuickFix can edit (ablate) concepts in under a second with a closed-form
update, providing a significant 1000x speedup and comparable editing
performance to existing fine-tuning based editing methods.