ChatPaper.aiChatPaper

DiSA:自回归图像生成中的扩散步长退火

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

May 26, 2025
作者: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
cs.AI

摘要

越来越多的自回归模型,如MAR、FlowAR、xAR和Harmon,采用扩散采样技术以提升图像生成质量。然而,这一策略导致推理效率低下,因为通常需要50到100步扩散过程来采样一个标记。本文探讨如何有效解决这一问题。我们的核心动机在于,随着自回归过程中生成更多标记,后续标记的分布受到更多约束,因而更易于采样。直观而言,若模型已生成狗的一部分,剩余标记必须完成狗的描绘,因此其分布更为受限。实证数据支持我们的观点:在生成后期,多层感知器能准确预测下一个标记,其方差较低,且从噪声到标记的去噪路径更接近直线。基于这一发现,我们提出了扩散步数退火(DiSA),这是一种无需训练的方法,它随着生成标记的增加逐步减少扩散步数,例如初始使用50步,后期逐步减少至5步。由于DiSA源自我们对自回归模型中扩散特性的特定发现,它与现有专为扩散设计的加速方法互为补充。DiSA仅需在现有模型上添加几行代码即可实现,尽管方法简单,却能在保持生成质量的同时,为MAR和Harmon带来5至10倍的推理加速,为FlowAR和xAR带来1.4至2.5倍的提升。
English
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves 5-10times faster inference for MAR and Harmon and 1.4-2.5times for FlowAR and xAR, while maintaining the generation quality.

Summary

AI-Generated Summary

PDF21May 27, 2025