ChatPaper.aiChatPaper

使用沙漏擴散變壓器進行可擴展的高解析度像素空間圖像合成

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

January 21, 2024
作者: Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z. Kaplan, Enrico Shippole
cs.AI

摘要

我們提出了Hourglass Diffusion Transformer(HDiT),這是一種影像生成模型,具有與像素數量線性擴展的特性,支持直接在像素空間進行高分辨率(例如1024乘1024)的訓練。基於Transformer架構,該架構已知可以擴展到數十億個參數,HDiT填補了卷積U-Net的效率和Transformer的可擴展性之間的差距。HDiT成功訓練,無需典型的高分辨率訓練技術,如多尺度架構、潛在自編碼器或自我條件訓練。我們展示了HDiT在ImageNet 256^2上與現有模型競爭,並在FFHQ-1024^2上為擴散模型設立了新的技術水準。
English
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.
PDF232December 15, 2024