融合自回归Transformer与扩散模型的多参考自回归方法

摘要

我们推出了TransDiff，这是首个将自回归（AR）Transformer与扩散模型相结合的图像生成模型。在这一联合建模框架中，TransDiff将标签和图像编码为高级语义特征，并利用扩散模型来估计图像样本的分布。在ImageNet 256x256基准测试中，TransDiff显著超越了基于单一AR Transformer或扩散模型的其他图像生成模型。具体而言，TransDiff实现了1.61的弗雷歇初始距离（FID）和293.4的初始分数（IS），并且与基于AR Transformer的最先进方法相比，推理延迟快了2倍，与仅使用扩散模型的方法相比，推理延迟快了112倍。此外，基于TransDiff模型，我们引入了一种名为多参考自回归（MRAR）的新颖图像生成范式，该范式通过预测下一张图像来执行自回归生成。MRAR使模型能够参考多个先前生成的图像，从而促进学习更多样化的表示，并提高后续迭代中生成图像的质量。通过应用MRAR，TransDiff的性能得到提升，FID从1.61降至1.42。我们期待TransDiff为图像生成领域开辟新的前沿。

English

We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fr\'echet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.

融合自回归Transformer与扩散模型的多参考自回归方法

Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

摘要

Support