ChatPaper.aiChatPaper

Switti:为文本到图像合成设计分层Transformer

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

December 2, 2024
作者: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
cs.AI

摘要

本文介绍了Switti,一种用于文本到图像生成的分尺度Transformer。从现有的下一尺度预测自回归(AR)模型出发,我们首先探索了它们在T2I生成中的应用,并提出了架构修改以提高其收敛性和整体性能。然后我们观察到,我们预训练的分尺度AR模型的自注意力图对前尺度的依赖较弱。基于这一观察,我们提出了一个非AR对应物,促进了约11%更快的采样速度和更低的内存使用,同时也实现了略微更好的生成质量。此外,我们发现在高分辨率尺度上无需分类器指导往往是不必要的,甚至可能降低性能。通过在这些尺度上禁用指导,我们实现了额外约20%的采样加速,并改善了细粒度细节的生成。广泛的人类偏好研究和自动化评估显示,Switti优于现有的T2I AR模型,并与最先进的T2I扩散模型竞争,同时速度快多达7倍。
English
This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating {sim}11% faster sampling and lower memory usage while also achieving slightly better generation quality.Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of {sim}20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7{times} faster.
PDF353December 3, 2024