結合自回歸Transformer與多參考自回歸的擴散模型
Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression
June 11, 2025
作者: Dingcheng Zhen, Qian Qiao, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao
cs.AI
摘要
我們推出了TransDiff,這是首個將自回歸(AR)Transformer與擴散模型相結合的圖像生成模型。在此聯合建模框架中,TransDiff將標籤和圖像編碼為高層次語義特徵,並利用擴散模型來估計圖像樣本的分佈。在ImageNet 256x256基準測試中,TransDiff顯著超越了基於獨立AR Transformer或擴散模型的其他圖像生成模型。具體而言,TransDiff實現了1.61的Fréchet Inception Distance(FID)和293.4的Inception Score(IS),並且相比基於AR Transformer的最新方法,推理延遲快了2倍,相比僅使用擴散模型的方案,推理速度更是提升了112倍。此外,基於TransDiff模型,我們引入了一種名為多參考自回歸(MRAR)的新穎圖像生成範式,該範式通過預測下一張圖像來執行自回歸生成。MRAR使模型能夠參考多個先前生成的圖像,從而促進學習更多樣化的表示,並在後續迭代中提升生成圖像的質量。應用MRAR後,TransDiff的性能得到提升,FID從1.61降低至1.42。我們期待TransDiff能為圖像生成領域開闢新的前沿。
English
We introduce TransDiff, the first image generation model that marries
Autoregressive (AR) Transformer with diffusion models. In this joint modeling
framework, TransDiff encodes labels and images into high-level semantic
features and employs a diffusion model to estimate the distribution of image
samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms
other image generation models based on standalone AR Transformer or diffusion
models. Specifically, TransDiff achieves a Fr\'echet Inception Distance (FID)
of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster
inference latency compared to state-of-the-art methods based on AR Transformer
and x112 faster inference compared to diffusion-only models. Furthermore,
building on the TransDiff model, we introduce a novel image generation paradigm
called Multi-Reference Autoregression (MRAR), which performs autoregressive
generation by predicting the next image. MRAR enables the model to reference
multiple previously generated images, thereby facilitating the learning of more
diverse representations and improving the quality of generated images in
subsequent iterations. By applying MRAR, the performance of TransDiff is
improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open
up a new frontier in the field of image generation.