使用沙漏擴散變壓器進行可擴展的高解析度像素空間圖像合成

摘要

我們提出了Hourglass Diffusion Transformer（HDiT），這是一種影像生成模型，具有與像素數量線性擴展的特性，支持直接在像素空間進行高分辨率（例如1024乘1024）的訓練。基於Transformer架構，該架構已知可以擴展到數十億個參數，HDiT填補了卷積U-Net的效率和Transformer的可擴展性之間的差距。HDiT成功訓練，無需典型的高分辨率訓練技術，如多尺度架構、潛在自編碼器或自我條件訓練。我們展示了HDiT在ImageNet 256^2上與現有模型競爭，並在FFHQ-1024^2上為擴散模型設立了新的技術水準。

English

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.

使用沙漏擴散變壓器進行可擴展的高解析度像素空間圖像合成

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

摘要

Support