ChatPaper.aiChatPaper

做自己:多主题文本到图像生成的有界注意力

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

March 25, 2024
作者: Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
cs.AI

摘要

文本到图像扩散模型具有前所未有的能力生成多样且高质量的图像。然而,它们通常难以忠实地捕捉包含多个主题的复杂输入提示的预期语义。最近,引入了许多布局到图像的扩展,以提高用户控制能力,旨在定位由特定标记表示的主题。然而,这些方法在处理多个语义或视觉上相似的主题时,通常会产生语义不准确的图像。在这项工作中,我们研究并分析了这些限制的原因。我们的探索揭示了主要问题源于去噪过程中主题之间无意的语义泄漏。这种泄漏归因于扩散模型的注意力层,这些层倾向于混合不同主题的视觉特征。为了解决这些问题,我们引入了有界注意力,这是一种无需训练的方法,用于限制采样过程中的信息流。有界注意力可以防止主题之间的有害泄漏,并促使生成过程引导每个主题的独特性,即使在复杂的多主题条件下也是如此。通过广泛的实验,我们证明了我们的方法增强了生成与给定提示和布局更加符合的多个主题。
English
Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Summary

AI-Generated Summary

PDF262December 15, 2024