UNCAGE: テキストから画像生成におけるマスク生成トランスフォーマーのためのコントラスティブ注意誘導

要旨

テキストから画像（T2I）生成は、拡散モデル（Diffusion Models）と自己回帰モデル（Autoregressive Models）を用いて活発に研究されてきた。最近では、マスク付き生成トランスフォーマー（Masked Generative Transformers）が、因果的注意（causal attention）と自己回帰デコーディング（autoregressive decoding）の制限を克服するための代替手法として注目を集めている。これにより、双方向注意（bidirectional attention）と並列デコーディング（parallel decoding）を通じて、効率的で高品質な画像生成が可能となった。しかし、構成要素を考慮したT2I生成は依然として課題であり、最先端の拡散モデルでさえ、属性を正確に結びつけたり、テキストと画像の整合性を適切に達成したりすることが難しい。拡散モデルはこの問題に対して広く研究されてきたが、マスク付き生成トランスフォーマーも同様の制限を示すものの、この文脈ではまだ検討されていない。この課題に対処するため、我々は「Unmasking with Contrastive Attention Guidance（UNCAGE）」を提案する。これは、個々のオブジェクトを明確に表現するトークンのアンマスキングを優先するために注意マップを活用し、構成要素の忠実度を向上させる新しいトレーニング不要の手法である。UNCAGEは、複数のベンチマークと評価指標において、定量的および定性的な評価で一貫して性能を向上させ、推論時のオーバーヘッドも無視できる程度である。我々のコードはhttps://github.com/furiosa-ai/uncageで公開されている。

English

Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

UNCAGE: テキストから画像生成におけるマスク生成トランスフォーマーのためのコントラスティブ注意誘導

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

要旨

Support