DiffiT: Trasformatori Vision con Diffusione per la Generazione di Immagini

Abstract

I modelli di diffusione, con la loro potente espressività e l'elevata qualità dei campioni, hanno abilitato molte nuove applicazioni e casi d'uso in vari domini. Per la generazione di campioni, questi modelli si basano su una rete neurale di denoising che genera immagini attraverso un processo iterativo di rimozione del rumore. Tuttavia, il ruolo dell'architettura della rete di denoising non è stato ampiamente studiato, con la maggior parte degli sforzi che si affidano a U-Net residuali convoluzionali. In questo articolo, studiamo l'efficacia dei vision transformer nell'apprendimento generativo basato sulla diffusione. Nello specifico, proponiamo un nuovo modello, denominato Diffusion Vision Transformers (DiffiT), che consiste in un'architettura ibrida gerarchica con un encoder e un decoder a forma di U. Introduciamo un nuovo modulo di self-attention dipendente dal tempo che consente ai livelli di attenzione di adattare il loro comportamento in diverse fasi del processo di denoising in modo efficiente. Introduciamo anche il DiffiT latente, che consiste in un modello transformer con i livelli di self-attention proposti, per la generazione di immagini ad alta risoluzione. I nostri risultati mostrano che DiffiT è sorprendentemente efficace nella generazione di immagini ad alta fedeltà e raggiunge benchmark state-of-the-art (SOTA) in una varietà di task di sintesi condizionata e non condizionata. Nello spazio latente, DiffiT raggiunge un nuovo punteggio SOTA FID di 1.73 sul dataset ImageNet-256. Repository: https://github.com/NVlabs/DiffiT

English

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT

DiffiT: Trasformatori Vision con Diffusione per la Generazione di Immagini

DiffiT: Diffusion Vision Transformers for Image Generation

Abstract

Support