ChatPaper.aiChatPaper

D-AR:基于自回归模型的扩散方法

D-AR: Diffusion via Autoregressive Models

May 29, 2025
作者: Ziteng Gao, Mike Zheng Shou
cs.AI

摘要

本文提出了一种基于自回归模型的扩散方法(D-AR),这一新范式将图像扩散过程重新定义为标准的下一令牌预测式自回归流程。我们首先设计了将图像转换为离散令牌序列的编码器,其中不同位置的令牌可解码为像素空间中不同的扩散去噪步骤。得益于扩散特性,这些令牌自然遵循从粗到细的顺序,这直接适用于自回归建模。因此,我们在这些令牌上应用标准的下一令牌预测,无需修改任何底层设计(无论是因果掩码还是训练/推理策略),这种序列化的自回归令牌生成直接映射了图像空间中的扩散过程。即,一旦自回归模型生成了一组增量令牌,我们就能以流式方式直接将这些令牌解码为相应的扩散去噪步骤。我们的流程自然揭示了几项有趣特性,例如,在仅生成部分令牌时支持一致的预览,并实现零样本布局控制合成。在标准的ImageNet基准测试中,我们的方法使用包含256个离散令牌的775M Llama骨干网络,取得了2.09的FID分数。我们希望这项工作能激发未来关于视觉合成的统一自回归架构研究,特别是结合大规模语言模型。代码和模型将发布于https://github.com/showlab/D-AR。
English
This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

Summary

AI-Generated Summary

PDF332May 30, 2025