UNCAGE：面向文本到图像生成中掩码生成式Transformer的对比注意力引导

摘要

文本到图像（T2I）生成技术已通过扩散模型和自回归模型得到了深入研究。近期，掩码生成式Transformer作为一种替代自回归模型的方法，因其双向注意力机制与并行解码能力，有效克服了因果注意力及自回归解码的固有局限，实现了高效且高质量的图像生成。然而，组合式T2I生成仍面临挑战，即便是最先进的扩散模型也常难以精确绑定属性并确保文本与图像的准确对齐。尽管扩散模型在此问题上已得到广泛探讨，掩码生成式Transformer虽表现出相似局限，却尚未在此背景下被深入探究。为此，我们提出了“对比注意力引导下的解掩码”（UNCAGE），一种无需额外训练的新方法，它通过利用注意力图优先解掩那些明确代表单个对象的标记，从而提升组合保真度。UNCAGE在多个基准测试和评价指标上均展现出性能的持续提升，且推理开销微乎其微。我们的代码已公开于https://github.com/furiosa-ai/uncage。

English

Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

UNCAGE：面向文本到图像生成中掩码生成式Transformer的对比注意力引导

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

摘要

Support