UNCAGE:對比注意力引導在文本到圖像生成中的遮罩生成變換器
UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
August 7, 2025
作者: Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
cs.AI
摘要
基於擴散模型和自回歸模型的文本到圖像(T2I)生成技術已得到廣泛研究。最近,掩碼生成變壓器作為自回歸模型的替代方案受到關注,它通過雙向注意力和並行解碼克服了因果注意力和自回歸解碼的固有侷限,實現了高效且高質量的圖像生成。然而,組合式T2I生成仍然具有挑戰性,因為即使是最先進的擴散模型也常常無法準確綁定屬性並實現文本與圖像的正確對齊。雖然擴散模型已針對此問題進行了深入研究,但掩碼生成變壓器在這一背景下表現出類似的侷限性,卻尚未得到探討。為此,我們提出了一種名為「對比注意力引導下的解碼」(UNCAGE)的新穎免訓練方法,該方法利用注意力圖來優先解碼那些清晰代表單個物體的標記,從而提升組合保真度。UNCAGE在多個基準和指標的定量與定性評估中均展現出性能的持續提升,且推理開銷微乎其微。我們的代碼已公開於https://github.com/furiosa-ai/uncage。
English
Text-to-image (T2I) generation has been actively studied using Diffusion
Models and Autoregressive Models. Recently, Masked Generative Transformers have
gained attention as an alternative to Autoregressive Models to overcome the
inherent limitations of causal attention and autoregressive decoding through
bidirectional attention and parallel decoding, enabling efficient and
high-quality image generation. However, compositional T2I generation remains
challenging, as even state-of-the-art Diffusion Models often fail to accurately
bind attributes and achieve proper text-image alignment. While Diffusion Models
have been extensively studied for this issue, Masked Generative Transformers
exhibit similar limitations but have not been explored in this context. To
address this, we propose Unmasking with Contrastive Attention Guidance
(UNCAGE), a novel training-free method that improves compositional fidelity by
leveraging attention maps to prioritize the unmasking of tokens that clearly
represent individual objects. UNCAGE consistently improves performance in both
quantitative and qualitative evaluations across multiple benchmarks and
metrics, with negligible inference overhead. Our code is available at
https://github.com/furiosa-ai/uncage.