在文本到圖像生成模型中的知識本地化和編輯

摘要

文本到圖像擴散模型，如Stable-Diffusion和Imagen，在MS-COCO和其他生成基準上取得了前所未有的逼真度，並且具有最先進的FID分數。給定一個標題，圖像生成需要關於物體結構、風格、觀點等屬性的細粒度知識。這些信息在文本到圖像生成模型中存儲在哪裡？在我們的論文中，我們探討這個問題，並了解有關不同視覺屬性的知識如何存儲在大規模文本到圖像擴散模型中。我們為文本到圖像模型適應因果中介分析，並將有關不同視覺屬性的知識追溯到擴散模型中的（i）UNet和（ii）文本編碼器的各個（因果）組件。特別是，我們發現，與生成大型語言模型不同，有關不同屬性的知識並不局限於孤立的組件，而是分佈在條件UNet的一組組件中。這些組件集通常對於不同的視覺屬性是不同的。值得注意的是，我們發現在Stable-Diffusion等公共文本到圖像模型中的CLIP文本編碼器僅包含一個因果狀態，跨不同的視覺屬性，這是與標題中屬性的最後一個主題標記對應的第一個自我關注層。這與其他語言模型中的因果狀態形成鮮明對比，後者通常是中間MLP層。基於對文本編碼器中僅存在一個因果狀態的觀察，我們引入了一種快速、無數據的模型編輯方法Diff-QuickFix，可以有效地編輯文本到圖像模型中的概念。Diff-QuickFix可以在不到一秒的時間內編輯（刪除）概念，提供顯著的1000倍加速和與現有微調為基礎的編輯方法相當的編輯性能。

English

Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

在文本到圖像生成模型中的知識本地化和編輯

Localizing and Editing Knowledge in Text-to-Image Generative Models

摘要

Support