DiffiT：擴散視覺Transformer用於圖像生成

摘要

擁有強大的表達能力和高樣本質量的擴散模型已經在各個領域實現了許多新的應用和用例。對於樣本生成，這些模型依賴於一個通過迭代去噪生成圖像的神經網絡。然而，去噪網絡架構的作用尚未得到很好的研究，大多數努力都依賴於卷積殘差 U-Net。在本文中，我們研究了視覺Transformer在基於擴散的生成學習中的有效性。具體而言，我們提出了一個新模型，稱為Diffusion Vision Transformers（DiffiT），它由具有U形編碼器和解碼器的混合分層架構組成。我們引入了一個新穎的時間依賴自注意力模塊，使得注意力層能夠以高效的方式在去噪過程的不同階段適應其行為。我們還介紹了潛在的DiffiT，它由具有所提出的自注意力層的Transformer模型組成，用於高分辨率圖像生成。我們的結果表明，DiffiT 在生成高保真度圖像方面效果驚人，並在各種有條件和無條件的合成任務上實現了最新技術（SOTA）基準。在潛在空間中，DiffiT 在 ImageNet-256 數據集上實現了新的 SOTA FID 分數為 1.73。存儲庫：https://github.com/NVlabs/DiffiT

English

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT

DiffiT：擴散視覺Transformer用於圖像生成

DiffiT: Diffusion Vision Transformers for Image Generation

摘要

Support