Schaalbare beeldgeneratie met hoge resolutie in pixelruimte met Hourglass Diffusion Transformers

Samenvatting

We presenteren de Hourglass Diffusion Transformer (HDiT), een beeldgeneratiemodel dat lineair schaalt met het aantal pixels, waardoor training op hoge resolutie (bijvoorbeeld 1024 keer 1024) direct in pixelruimte mogelijk is. Gebaseerd op de Transformer-architectuur, die bekend staat om zijn schaalbaarheid naar miljarden parameters, overbrugt het de kloof tussen de efficiëntie van convolutionele U-Nets en de schaalbaarheid van Transformers. HDiT traint succesvol zonder typische technieken voor training op hoge resolutie, zoals multischaalarchitecturen, latente auto-encoders of zelfconditionering. We tonen aan dat HDiT concurrerend presteert met bestaande modellen op ImageNet 256^2 en een nieuwe state-of-the-art neerzet voor diffusiemodellen op FFHQ-1024^2.

English

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.

Schaalbare beeldgeneratie met hoge resolutie in pixelruimte met Hourglass Diffusion Transformers

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Samenvatting

Support