DART：用於可擴展文本到圖像生成的去噪自回歸Transformer

摘要

擴散模型已成為視覺生成的主要方法。它們通過去噪馬可夫過程進行訓練，逐漸向輸入添加噪聲。我們認為馬可夫性質限制了模型充分利用生成軌跡的能力，導致訓練和推理過程中的低效率。在本文中，我們提出了DART，這是一個基於Transformer的模型，將自回歸（AR）和擴散統一在非馬可夫框架中。DART通過使用與標準語言模型相同架構的AR模型，對圖像補丁進行空間和頻譜去噪。DART不依賴圖像量化，從而實現更有效的圖像建模，同時保持靈活性。此外，DART可以無縫地在統一模型中訓練文本和圖像數據。我們的方法在類別條件和文本到圖像生成任務上展現了競爭力，為傳統擴散模型提供了一種可擴展、高效的替代方案。通過這一統一框架，DART為可擴展、高質量的圖像合成設立了新的基準。

English

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard language models. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

DART：用於可擴展文本到圖像生成的去噪自回歸Transformer

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

摘要

Support