DART：用于可扩展文本到图像生成的去噪自回归Transformer

摘要

扩散模型已成为视觉生成的主流方法。它们通过去噪马尔可夫过程进行训练，逐渐向输入添加噪声。我们认为马尔可夫属性限制了模型充分利用生成轨迹的能力，导致训练和推理过程中的低效率。在本文中，我们提出了DART，这是一个基于Transformer的模型，将自回归（AR）和扩散融合在一个非马尔可夫框架中。DART使用与标准语言模型相同架构的AR模型，在空间和频谱上迭代去噪图像块。DART不依赖图像量化，可以更有效地建模图像同时保持灵活性。此外，DART可以无缝地在统一模型中训练文本和图像数据。我们的方法在类别条件和文本到图像生成任务上表现出竞争力，为传统扩散模型提供了一种可扩展、高效的替代方案。通过这一统一框架，DART为可扩展、高质量的图像合成设立了新的基准。

English

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard language models. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

DART：用于可扩展文本到图像生成的去噪自回归Transformer

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

摘要

Support