ChatPaper.aiChatPaper

D-AR:基于自回归模型的扩散方法

D-AR: Diffusion via Autoregressive Models

May 29, 2025
作者: Ziteng Gao, Mike Zheng Shou
cs.AI

摘要

本文提出了一种新的范式——通过自回归模型实现扩散(D-AR),将图像扩散过程重新定义为一种标准的、以预测下一标记为方式的自回归过程。我们首先设计了将图像转换为离散标记序列的标记器,其中不同位置的标记可以在像素空间中解码为不同的扩散去噪步骤。得益于扩散的特性,这些标记自然遵循从粗到细的顺序,这直接适用于自回归建模。因此,我们在这些标记上应用标准的下一标记预测,而无需修改任何底层设计(无论是因果掩码还是训练/推理策略),这种序列化的自回归标记生成直接反映了图像空间中的扩散过程。也就是说,一旦自回归模型生成了一组增量标记,我们就可以直接以流式方式将这些标记解码为相应的扩散去噪步骤。我们的流程自然揭示了一些有趣的性质,例如,它支持在仅生成部分标记时提供一致的预览,并实现了零样本布局控制的合成。在标准的ImageNet基准测试中,我们的方法使用775M参数的Llama骨干网络和256个离散标记,实现了2.09的FID分数。我们希望我们的工作能够激发未来关于视觉合成的统一自回归架构的研究,特别是与大型语言模型结合的研究。代码和模型将在https://github.com/showlab/D-AR 上提供。
English
This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

Summary

AI-Generated Summary

PDF342May 30, 2025