UNCAGE: 텍스트-이미지 생성을 위한 마스크 생성 트랜스포머의 대조적 주의 안내

초록

텍스트-이미지(T2I) 생성은 디퓨전 모델과 자기회귀 모델을 사용하여 활발히 연구되어 왔습니다. 최근에는 양방향 주의 메커니즘과 병렬 디코딩을 통해 인과적 주의와 자기회귀 디코딩의 고유한 한계를 극복하는 대안으로 마스크드 생성 트랜스포머가 주목받고 있으며, 이를 통해 효율적이고 고품질의 이미지 생성이 가능해졌습니다. 그러나 구성적 T2I 생성은 여전히 어려운 과제로 남아 있습니다. 최첨단 디퓨전 모델조차도 속성을 정확히 결합하고 텍스트-이미지 정렬을 적절히 달성하는 데 실패하는 경우가 많기 때문입니다. 디퓨전 모델은 이 문제에 대해 광범위하게 연구되었지만, 마스크드 생성 트랜스포머도 유사한 한계를 보이면서도 이와 관련된 연구는 아직 이루어지지 않았습니다. 이를 해결하기 위해 우리는 개별 객체를 명확히 표현하는 토큰의 언마스킹을 우선시하기 위해 주의 맵을 활용하는 새로운 학습 불필요 방법인 Unmasking with Contrastive Attention Guidance(UNCAGE)를 제안합니다. UNCAGE는 여러 벤치마크와 메트릭에서 정량적 및 정성적 평가 모두에서 일관되게 성능을 향상시키며, 추론 오버헤드는 무시할 수 있을 정도로 작습니다. 우리의 코드는 https://github.com/furiosa-ai/uncage에서 확인할 수 있습니다.

English

Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

UNCAGE: 텍스트-이미지 생성을 위한 마스크 생성 트랜스포머의 대조적 주의 안내

UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

초록

Support