ChatPaper.aiChatPaper

做自己:多主題文本到圖像生成的有界注意力

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

March 25, 2024
作者: Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
cs.AI

摘要

文字到圖像擴散模型具有前所未有的能力,能夠生成多樣且高質量的圖像。然而,它們常常難以忠實捕捉包含多個主題的複雜輸入提示的預期語義。最近,許多布局到圖像的擴展方法被引入,以提高用戶控制,旨在定位由特定標記表示的主題。然而,這些方法通常會產生語義不準確的圖像,特別是在處理多個語義或視覺上相似的主題時。在這項工作中,我們研究並分析了這些限制的原因。我們的探索顯示,主要問題源於去噪過程中主題之間的意外語義泄漏。這種泄漏歸因於擴散模型的注意力層,它們傾向於混合不同主題的視覺特徵。為了解決這些問題,我們引入了有界注意力,這是一種無需訓練的方法,用於限制採樣過程中的信息流。有界注意力防止主題之間的有害泄漏,並能夠引導生成,促進每個主題的個性,即使在複雜的多主題條件下也是如此。通過大量實驗,我們展示了我們的方法能夠增強生成與給定提示和布局更符合的多個主題。
English
Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Summary

AI-Generated Summary

PDF262December 15, 2024