텍스트-이미지 생성 모델에서 지식의 지역화 및 편집

초록

Stable-Diffusion 및 Imagen과 같은 텍스트-이미지 확산 모델(Text-to-Image Diffusion Models)은 MS-COCO 및 기타 생성 벤치마크에서 최첨단 FID(Fréchet Inception Distance) 점수를 달성하며 전례 없는 수준의 사실적 화질을 구현했습니다. 캡션(caption)이 주어졌을 때, 이미지 생성은 객체 구조, 스타일, 시점(viewpoint)과 같은 속성에 대한 세밀한 지식을 요구합니다. 이러한 정보는 텍스트-이미지 생성 모델의 어디에 존재할까요? 본 논문에서는 이 질문을 다루며, 다양한 시각적 속성에 해당하는 지식이 대규모 텍스트-이미지 확산 모델에서 어떻게 저장되는지 이해합니다. 우리는 텍스트-이미지 모델을 위해 인과 매개 분석(Causal Mediation Analysis)을 적용하고, (i) UNet과 (ii) 확산 모델의 텍스트 인코더(text-encoder) 내 다양한 (인과적) 구성 요소로부터 구별되는 시각적 속성에 대한 지식을 추적합니다. 특히, 생성형 대규모 언어 모델(generative large-language models)과 달리, 다양한 속성에 대한 지식은 고립된 구성 요소에 국한되지 않고, 조건부 UNet(conditional UNet) 내 일련의 구성 요소들에 분산되어 있음을 보여줍니다. 이러한 구성 요소 집합은 종종 서로 다른 시각적 속성에 대해 구별됩니다. 주목할 만한 점은, Stable-Diffusion과 같은 공개 텍스트-이미지 모델에서 CLIP 텍스트 인코더(CLIP text-encoder)는 다양한 시각적 속성에 걸쳐 단 하나의 인과적 상태(causal state)만을 포함하며, 이는 캡션 내 속성의 마지막 주체 토큰(subject token)에 해당하는 첫 번째 자기 주의 층(self-attention layer)이라는 것입니다. 이는 종종 중간 MLP 층(mid-MLP layers)이 인과적 상태인 다른 언어 모델과는 대조적입니다. 텍스트 인코더 내 단 하나의 인과적 상태라는 이 관찰을 바탕으로, 우리는 텍스트-이미지 모델에서 개념을 효과적으로 편집할 수 있는 빠르고 데이터가 필요 없는 모델 편집 방법인 Diff-QuickFix를 소개합니다. Diff-QuickFix는 폐쇄형 업데이트(closed-form update)를 통해 1초 미만으로 개념을 편집(또는 제거)할 수 있으며, 기존의 미세 조정(fine-tuning) 기반 편집 방법과 비교할 만한 성능을 제공하면서도 1000배의 속도 향상을 달성합니다.

English

Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.

텍스트-이미지 생성 모델에서 지식의 지역화 및 편집

Localizing and Editing Knowledge in Text-to-Image Generative Models

초록

Support