テキストから画像を生成するモデルにおける知識の局所化と編集

要旨

Stable-DiffusionやImagenなどのText-to-Image Diffusion Modelは、MS-COCOやその他の生成ベンチマークにおいて、最先端のFIDスコアを達成し、これまでにない写実的な品質を実現しています。キャプションが与えられた場合、画像生成には、オブジェクトの構造、スタイル、視点などの属性に関する細かな知識が必要です。この情報は、テキストから画像を生成するモデルのどこに存在するのでしょうか？本論文では、この疑問に取り組み、大規模なテキストから画像を生成するDiffusion Modelにおいて、異なる視覚的属性に対応する知識がどのように保存されているかを理解します。我々は、テキストから画像を生成するモデルに対してCausal Mediation Analysisを適用し、異なる視覚的属性に関する知識を、(i) UNetおよび(ii) Diffusion Modelのテキストエンコーダ内の様々な（因果的）コンポーネントにトレースします。特に、生成型大規模言語モデルとは異なり、異なる属性に関する知識は孤立したコンポーネントに局在化されず、代わりに条件付きUNet内の一連のコンポーネントに分散されていることを示します。これらのコンポーネントのセットは、異なる視覚的属性に対してしばしば異なります。注目すべきは、Stable-Diffusionなどの公開されているテキストから画像を生成するモデルにおいて、CLIPテキストエンコーダは異なる視覚的属性に対してたった一つの因果的状態しか含まないことです。そして、それはキャプション内の属性の最後の主語トークンに対応する最初のself-attention層です。これは、しばしば中間のMLP層である他の言語モデルの因果的状態とは対照的です。テキストエンコーダ内にたった一つの因果的状態しか存在しないという観察に基づき、我々は、テキストから画像を生成するモデル内の概念を効果的に編集する高速でデータ不要なモデル編集手法Diff-QuickFixを導入します。DiffQuickFixは、閉形式の更新により1秒未満で概念を編集（除去）することができ、既存のファインチューニングベースの編集手法と同等の編集性能を提供しつつ、1000倍の高速化を実現します。

English

Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

テキストから画像を生成するモデルにおける知識の局所化と編集

Localizing and Editing Knowledge in Text-to-Image Generative Models

要旨

Support