DiffiT：扩散视觉Transformer用于图像生成

摘要

扩散模型以其强大的表达能力和高质量样本生成能力，在各个领域中实现了许多新的应用和用例。对于样本生成，这些模型依赖于一个通过迭代去噪生成图像的神经网络。然而，去噪网络架构的作用并未得到深入研究，大多数工作都依赖于卷积残差U-Net。本文研究了视觉Transformer在基于扩散的生成学习中的有效性。具体地，我们提出了一个新模型，称为扩散视觉Transformer（DiffiT），它由具有U形编码器和解码器的混合分层架构组成。我们引入了一种新颖的时间相关自注意力模块，使注意力层能够以高效的方式在去噪过程的不同阶段调整其行为。我们还引入了潜在的DiffiT，它由具有提出的自注意力层的Transformer模型组成，用于高分辨率图像生成。我们的结果表明，DiffiT在生成高保真图像方面效果显著，并在各种有条件和无条件合成任务的基准测试中取得了最先进的成绩。在潜在空间中，DiffiT在ImageNet-256数据集上实现了新的最先进FID分数为1.73。代码库：https://github.com/NVlabs/DiffiT

English

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT

DiffiT：扩散视觉Transformer用于图像生成

DiffiT: Diffusion Vision Transformers for Image Generation

摘要

Support