DiffiT：画像生成のための拡散型ビジョントランスフォーマー

要旨

拡散モデルはその強力な表現力と高品質なサンプル生成能力により、様々な分野で多くの新しいアプリケーションやユースケースを可能にしてきました。サンプル生成において、これらのモデルは反復的なノイズ除去を行うニューラルネットワークに依存しています。しかし、ノイズ除去ネットワークのアーキテクチャの役割は十分に研究されておらず、ほとんどの研究は畳み込み残差U-Netに依存しています。本論文では、拡散ベースの生成学習におけるビジョントランスフォーマーの有効性を研究します。具体的には、U字型のエンコーダーとデコーダーを備えたハイブリッド階層アーキテクチャからなる新しいモデル、Diffusion Vision Transformers（DiffiT）を提案します。また、ノイズ除去プロセスの異なる段階で注意層が効率的に動作を適応させるための新しい時間依存型セルフアテンションモジュールを導入します。さらに、高解像度画像生成のための提案されたセルフアテンションレイヤーを備えたトランスフォーマーモデルからなる潜在DiffiTも紹介します。我々の結果は、DiffiTが驚くほど高忠実度の画像生成に効果的であり、様々なクラス条件付きおよび無条件の合成タスクにおいて最先端（SOTA）のベンチマークを達成することを示しています。潜在空間では、DiffiTはImageNet-256データセットにおいて1.73の新しいSOTA FIDスコアを達成します。リポジトリ: https://github.com/NVlabs/DiffiT

English

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT

DiffiT：画像生成のための拡散型ビジョントランスフォーマー

DiffiT: Diffusion Vision Transformers for Image Generation

要旨

Support