使用沙漏扩散变换器进行可扩展的高分辨率像素空间图像合成

摘要

我们提出了Hourglass Diffusion Transformer（HDiT），这是一种图像生成模型，具有与像素数量线性扩展的特性，支持在高分辨率（例如1024乘1024）直接在像素空间进行训练。基于已知可以扩展到数十亿参数的Transformer架构，它弥合了卷积U-Net的效率和Transformer的可扩展性之间的差距。HDiT成功地进行训练，无需典型的高分辨率训练技术，如多尺度架构、潜在自编码器或自条件技术。我们展示了HDiT在ImageNet 256^2上与现有模型具有竞争力，并在FFHQ-1024^2上为扩散模型树立了新的技术水平。

English

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.

使用沙漏扩散变换器进行可扩展的高分辨率像素空间图像合成

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

摘要

Support