ChatPaper.aiChatPaper

使用沙漏扩散变换器进行可扩展的高分辨率像素空间图像合成

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

January 21, 2024
作者: Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole
cs.AI

摘要

我们提出了Hourglass Diffusion Transformer(HDiT),这是一种图像生成模型,具有与像素数量线性扩展的特性,支持在高分辨率(例如1024乘1024)直接在像素空间进行训练。基于已知可以扩展到数十亿参数的Transformer架构,它弥合了卷积U-Net的效率和Transformer的可扩展性之间的差距。HDiT成功地进行训练,无需典型的高分辨率训练技术,如多尺度架构、潜在自编码器或自条件技术。我们展示了HDiT在ImageNet 256^2上与现有模型具有竞争力,并在FFHQ-1024^2上为扩散模型树立了新的技术水平。
English
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.
PDF232December 15, 2024