D-AR: 자기회귀 모델을 통한 확산

초록

본 논문은 이미지 확산(diffusion) 과정을 표준적인 다음 토큰 예측 방식의 단순 자기회귀(autoregressive) 절차로 재구성한 새로운 패러다임인 D-AR(Diffusion via Autoregressive models)을 소개합니다. 먼저, 이미지를 이산적 토큰 시퀀스로 변환하는 토크나이저를 설계하며, 이때 서로 다른 위치의 토큰들은 픽셀 공간에서의 서로 다른 확산 노이즈 제거 단계로 디코딩될 수 있습니다. 확산 모델의 특성 덕분에, 이러한 토큰들은 자연스럽게 coarse-to-fine(거친 것에서 세밀한 것으로) 순서를 따르며, 이는 자기회귀 모델링에 직접적으로 적용될 수 있습니다. 따라서, 우리는 이러한 토큰들에 대해 표준적인 다음 토큰 예측을 적용하며, 근본적인 설계(인과적 마스크 또는 학습/추론 전략 등)를 수정하지 않습니다. 이러한 순차적 자기회귀 토큰 생성은 이미지 공간에서의 확산 과정을 직접적으로 반영합니다. 즉, 자기회귀 모델이 토큰의 증가분을 생성하면, 이러한 토큰들을 스트리밍 방식으로 해당하는 확산 노이즈 제거 단계로 직접 디코딩할 수 있습니다. 우리의 파이프라인은 여러 흥미로운 특성을 자연스럽게 드러내는데, 예를 들어, 토큰의 일부만 생성할 때 일관된 미리보기를 지원하며, 제로샷 레이아웃 제어 합성을 가능하게 합니다. 표준 ImageNet 벤치마크에서, 우리의 방법은 775M Llama 백본과 256개의 이산적 토큰을 사용하여 2.09 FID를 달성했습니다. 우리의 연구가 특히 대규모 언어 모델을 활용한 시각적 합성의 통합 자기회귀 아키텍처에 대한 향후 연구에 영감을 줄 수 있기를 바랍니다. 코드와 모델은 https://github.com/showlab/D-AR에서 제공될 예정입니다.

English

This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR

D-AR: 자기회귀 모델을 통한 확산

D-AR: Diffusion via Autoregressive Models

초록

Support