DiSA:自迴歸圖像生成中的擴散步長退火
DiSA: Diffusion Step Annealing in Autoregressive Image Generation
May 26, 2025
作者: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
cs.AI
摘要
越來越多的自回歸模型,如MAR、FlowAR、xAR和Harmon,採用擴散採樣來提升圖像生成的質量。然而,這一策略導致推理效率低下,因為擴散採樣通常需要50到100步來生成一個標記。本文探討如何有效解決這一問題。我們的核心動機是,隨著自回歸過程中生成更多標記,後續標記的分佈會受到更多約束,從而更容易採樣。直觀地解釋,如果模型已經生成了狗的一部分,剩餘的標記必須完成狗的圖像,因此受到更多限制。實證證據支持我們的動機:在生成後期階段,下一個標記可以通過多層感知器很好地預測,表現出低方差,並且遵循從噪聲到標記的更接近直線的去噪路徑。基於這一發現,我們引入了擴散步長退火(DiSA),這是一種無需訓練的方法,隨著生成更多標記,逐漸減少擴散步數,例如在開始時使用50步,並在後期逐漸減少到5步。由於DiSA源自我們對自回歸模型中擴散的特定發現,它與現有僅針對擴散的加速方法互補。DiSA只需在現有模型上添加幾行代碼即可實現,儘管簡單,卻能為MAR和Harmon實現5到10倍的推理加速,為FlowAR和xAR實現1.4到2.5倍的加速,同時保持生成質量。
English
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and
Harmon adopt diffusion sampling to improve the quality of image generation.
However, this strategy leads to low inference efficiency, because it usually
takes 50 to 100 steps for diffusion to sample a token. This paper explores how
to effectively address this issue. Our key motivation is that as more tokens
are generated during the autoregressive process, subsequent tokens follow more
constrained distributions and are easier to sample. To intuitively explain, if
a model has generated part of a dog, the remaining tokens must complete the dog
and thus are more constrained. Empirical evidence supports our motivation: at
later generation stages, the next tokens can be well predicted by a multilayer
perceptron, exhibit low variance, and follow closer-to-straight-line denoising
paths from noise to tokens. Based on our finding, we introduce diffusion step
annealing (DiSA), a training-free method which gradually uses fewer diffusion
steps as more tokens are generated, e.g., using 50 steps at the beginning and
gradually decreasing to 5 steps at later stages. Because DiSA is derived from
our finding specific to diffusion in autoregressive models, it is complementary
to existing acceleration methods designed for diffusion alone. DiSA can be
implemented in only a few lines of code on existing models, and albeit simple,
achieves 5-10times faster inference for MAR and Harmon and 1.4-2.5times
for FlowAR and xAR, while maintaining the generation quality.Summary
AI-Generated Summary