확장 가능한 고해상도 픽셀 공간 이미지 합성을 위한 아워글래스 디퓨전 트랜스포머

초록

본 논문에서는 픽셀 수에 대해 선형 스케일링을 보이며, 고해상도(예: 1024×1024)에서 픽셀 공간에서 직접 학습을 지원하는 이미지 생성 모델인 Hourglass Diffusion Transformer(HDiT)를 제안한다. 수십억 개의 파라미터로 확장 가능한 것으로 알려진 Transformer 아키텍처를 기반으로, HDiT는 컨볼루션 U-Net의 효율성과 Transformer의 확장성 간의 격차를 해소한다. HDiT는 다중 스케일 아키텍처, 잠재 오토인코더 또는 자기 조건화와 같은 일반적인 고해상도 학습 기법 없이도 성공적으로 학습된다. 우리는 HDiT가 ImageNet 256^2에서 기존 모델과 경쟁력 있는 성능을 보이며, FFHQ-1024^2에서 디퓨전 모델의 새로운 최첨단 기술을 설정함을 입증한다.

English

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. 1024 times 1024) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet 256^2, and sets a new state-of-the-art for diffusion models on FFHQ-1024^2.

확장 가능한 고해상도 픽셀 공간 이미지 합성을 위한 아워글래스 디퓨전 트랜스포머

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

초록

Support