DiSA: 자기회귀적 이미지 생성에서의 확산 단계 어닐링

초록

MAR, FlowAR, xAR, Harmon과 같은 점점 더 많은 자기회귀 모델들이 이미지 생성 품질을 향상시키기 위해 확산 샘플링을 채택하고 있습니다. 그러나 이 전략은 일반적으로 토큰을 샘플링하기 위해 50~100단계의 확산 과정이 필요하므로 추론 효율성이 낮아지는 문제를 야기합니다. 본 논문은 이 문제를 효과적으로 해결하는 방법을 탐구합니다. 우리의 핵심 동기는 자기회귀 과정에서 더 많은 토큰이 생성될수록, 후속 토큰들은 더 제한된 분포를 따르고 샘플링이 더 쉬워진다는 점입니다. 직관적으로 설명하자면, 모델이 개의 일부를 생성했다면 나머지 토큰들은 개를 완성해야 하므로 더 제한적일 수밖에 없습니다. 실험적 증거는 우리의 동기를 뒷받침합니다: 생성 후반 단계에서는 다음 토큰이 다층 퍼셉트론으로 잘 예측될 수 있으며, 낮은 분산을 보이고, 노이즈에서 토큰으로의 잡음 제거 경로가 직선에 가까워집니다. 이러한 발견을 바탕으로 우리는 확산 단계 어닐링(DiSA)을 제안합니다. DiSA는 학습이 필요 없는 방법으로, 더 많은 토큰이 생성될수록 점점 더 적은 확산 단계를 사용합니다(예: 초기에는 50단계를 사용하고 후반으로 갈수록 5단계로 점진적으로 감소). DiSA는 자기회귀 모델에서의 확산에 특화된 우리의 발견에서 도출되었기 때문에, 확산만을 위한 기존 가속화 방법들과 상호 보완적입니다. DiSA는 기존 모델에 단 몇 줄의 코드로 구현할 수 있으며, 간단함에도 불구하고 MAR와 Harmon에서는 5~10배, FlowAR와 xAR에서는 1.4~2.5배 빠른 추론 속도를 달성하면서도 생성 품질을 유지합니다.

English

An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves 5-10times faster inference for MAR and Harmon and 1.4-2.5times for FlowAR and xAR, while maintaining the generation quality.

DiSA: 자기회귀적 이미지 생성에서의 확산 단계 어닐링

DiSA: Diffusion Step Annealing in Autoregressive Image Generation

초록

Support