ChatPaper.aiChatPaper

Switti:設計尺度感知Transformer用於文本到圖像合成

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

December 2, 2024
作者: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
cs.AI

摘要

本研究提出了Switti,一種用於文本到圖像生成的規模式Transformer。從現有的下一規模預測AR模型出發,我們首先探索了它們在T2I生成方面的應用,並提出了架構修改以改善其收斂性和整體性能。然後,我們觀察到我們預訓練的規模式AR模型的自注意力映射對前幾個規模的依賴較弱。基於這一洞察,我們提出了一個非AR對應物,促進約11%更快的採樣速度和更低的內存使用,同時實現略微更好的生成質量。此外,我們揭示了在高分辨率尺度上無需分類器指導,甚至可能降低性能。通過在這些尺度上禁用指導,我們實現了約20%的額外採樣加速和改善了細節的生成。大量的人類偏好研究和自動評估顯示,Switti優於現有的T2I AR模型,並與最先進的T2I擴散模型競爭,同時速度提高了多達7倍。
English
This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating {sim}11% faster sampling and lower memory usage while also achieving slightly better generation quality.Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of {sim}20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7{times} faster.
PDF353December 3, 2024