ChatPaper.aiChatPaper

UNCAGE:對比注意力引導在文本到圖像生成中的遮罩生成變換器

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

August 7, 2025
作者: Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
cs.AI

摘要

基於擴散模型和自回歸模型的文本到圖像(T2I)生成技術已得到廣泛研究。最近,掩碼生成變壓器作為自回歸模型的替代方案受到關注,它通過雙向注意力和並行解碼克服了因果注意力和自回歸解碼的固有侷限,實現了高效且高質量的圖像生成。然而,組合式T2I生成仍然具有挑戰性,因為即使是最先進的擴散模型也常常無法準確綁定屬性並實現文本與圖像的正確對齊。雖然擴散模型已針對此問題進行了深入研究,但掩碼生成變壓器在這一背景下表現出類似的侷限性,卻尚未得到探討。為此,我們提出了一種名為「對比注意力引導下的解碼」(UNCAGE)的新穎免訓練方法,該方法利用注意力圖來優先解碼那些清晰代表單個物體的標記,從而提升組合保真度。UNCAGE在多個基準和指標的定量與定性評估中均展現出性能的持續提升,且推理開銷微乎其微。我們的代碼已公開於https://github.com/furiosa-ai/uncage。
English
Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.
PDF152August 13, 2025