UNCAGE:面向文本到图像生成中掩码生成式Transformer的对比注意力引导
UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
August 7, 2025
作者: Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
cs.AI
摘要
文本到图像(T2I)生成技术已通过扩散模型和自回归模型得到了深入研究。近期,掩码生成式Transformer作为一种替代自回归模型的方法,因其双向注意力机制与并行解码能力,有效克服了因果注意力及自回归解码的固有局限,实现了高效且高质量的图像生成。然而,组合式T2I生成仍面临挑战,即便是最先进的扩散模型也常难以精确绑定属性并确保文本与图像的准确对齐。尽管扩散模型在此问题上已得到广泛探讨,掩码生成式Transformer虽表现出相似局限,却尚未在此背景下被深入探究。为此,我们提出了“对比注意力引导下的解掩码”(UNCAGE),一种无需额外训练的新方法,它通过利用注意力图优先解掩那些明确代表单个对象的标记,从而提升组合保真度。UNCAGE在多个基准测试和评价指标上均展现出性能的持续提升,且推理开销微乎其微。我们的代码已公开于https://github.com/furiosa-ai/uncage。
English
Text-to-image (T2I) generation has been actively studied using Diffusion
Models and Autoregressive Models. Recently, Masked Generative Transformers have
gained attention as an alternative to Autoregressive Models to overcome the
inherent limitations of causal attention and autoregressive decoding through
bidirectional attention and parallel decoding, enabling efficient and
high-quality image generation. However, compositional T2I generation remains
challenging, as even state-of-the-art Diffusion Models often fail to accurately
bind attributes and achieve proper text-image alignment. While Diffusion Models
have been extensively studied for this issue, Masked Generative Transformers
exhibit similar limitations but have not been explored in this context. To
address this, we propose Unmasking with Contrastive Attention Guidance
(UNCAGE), a novel training-free method that improves compositional fidelity by
leveraging attention maps to prioritize the unmasking of tokens that clearly
represent individual objects. UNCAGE consistently improves performance in both
quantitative and qualitative evaluations across multiple benchmarks and
metrics, with negligible inference overhead. Our code is available at
https://github.com/furiosa-ai/uncage.