扩散模型的隐含语言
The Hidden Language of Diffusion Models
June 1, 2023
作者: Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, Lior Wolf
cs.AI
摘要
文本到图像扩散模型展示了从文本概念(例如“医生”,“爱”)生成高质量、多样化图像的无与伦比能力。然而,将文本映射到丰富的视觉表示的内部过程仍然是一个谜。在这项工作中,我们通过将输入文本提示分解为一小组可解释元素来应对理解文本到图像模型中概念表示的挑战。这是通过学习一个伪标记来实现的,该伪标记是模型词汇表中标记的稀疏加权组合,其目标是重建为给定概念生成的图像。应用于最先进的稳定扩散模型,这种分解揭示了概念表示中的非平凡和令人惊讶的结构。例如,我们发现一些概念,如“总统”或“作曲家”,由特定实例(例如“奥巴马”,“拜登”)及其插值所主导。其他概念,如“幸福”,结合了可以是具体的(“家庭”,“笑声”)或抽象的(“友谊”,“情感”)相关术语。除了深入了解稳定扩散的内部工作机制外,我们的方法还能实现单图像分解为标记、偏见检测和缓解,以及语义图像操作等应用。我们的代码将在以下网址提供:https://hila-chefer.github.io/Conceptor/
English
Text-to-image diffusion models have demonstrated an unparalleled ability to
generate high-quality, diverse images from a textual concept (e.g., "a doctor",
"love"). However, the internal process of mapping text to a rich visual
representation remains an enigma. In this work, we tackle the challenge of
understanding concept representations in text-to-image models by decomposing an
input text prompt into a small set of interpretable elements. This is achieved
by learning a pseudo-token that is a sparse weighted combination of tokens from
the model's vocabulary, with the objective of reconstructing the images
generated for the given concept. Applied over the state-of-the-art Stable
Diffusion model, this decomposition reveals non-trivial and surprising
structures in the representations of concepts. For example, we find that some
concepts such as "a president" or "a composer" are dominated by specific
instances (e.g., "Obama", "Biden") and their interpolations. Other concepts,
such as "happiness" combine associated terms that can be concrete ("family",
"laughter") or abstract ("friendship", "emotion"). In addition to peering into
the inner workings of Stable Diffusion, our method also enables applications
such as single-image decomposition to tokens, bias detection and mitigation,
and semantic image manipulation. Our code will be available at:
https://hila-chefer.github.io/Conceptor/