基于表征自编码器的扩散变换器
Diffusion Transformers with Representation Autoencoders
October 13, 2025
作者: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
cs.AI
摘要
潜变量生成建模,即通过预训练的自编码器将像素映射到扩散过程的潜在空间中,已成为扩散变换器(DiT)的标准策略;然而,自编码器组件几乎未有进展。多数DiT仍依赖原始的VAE编码器,这带来了几项局限:过时的骨干网络损害了架构的简洁性,低维潜在空间限制了信息容量,以及纯粹基于重建的训练导致表征能力不足,最终影响了生成质量。在本研究中,我们探索用预训练的表征编码器(如DINO、SigLIP、MAE)配合训练的解码器替换VAE,构建了我们称之为表征自编码器(RAE)的模型。这些模型不仅提供高质量的重建结果和语义丰富的潜在空间,还支持可扩展的基于变换器的架构。鉴于这些潜在空间通常为高维,一个关键挑战是如何使扩散变换器在其中高效运作。我们分析了这一难题的根源,提出了理论驱动的解决方案,并通过实验验证了其有效性。我们的方法在不依赖辅助表征对齐损失的情况下实现了更快的收敛。采用配备轻量级、宽DDT头的DiT变体,我们在ImageNet上取得了显著的图像生成成果:256x256分辨率下无引导的FID为1.51,256x256和512x512分辨率下有引导的FID均为1.13。RAE展现出明显优势,应成为扩散变换器训练的新基准。
English
Latent generative modeling, where a pretrained autoencoder maps pixels into a
latent space for the diffusion process, has become the standard strategy for
Diffusion Transformers (DiT); however, the autoencoder component has barely
evolved. Most DiTs continue to rely on the original VAE encoder, which
introduces several limitations: outdated backbones that compromise
architectural simplicity, low-dimensional latent spaces that restrict
information capacity, and weak representations that result from purely
reconstruction-based training and ultimately limit generative quality. In this
work, we explore replacing the VAE with pretrained representation encoders
(e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term
Representation Autoencoders (RAEs). These models provide both high-quality
reconstructions and semantically rich latent spaces, while allowing for a
scalable transformer-based architecture. Since these latent spaces are
typically high-dimensional, a key challenge is enabling diffusion transformers
to operate effectively within them. We analyze the sources of this difficulty,
propose theoretically motivated solutions, and validate them empirically. Our
approach achieves faster convergence without auxiliary representation alignment
losses. Using a DiT variant equipped with a lightweight, wide DDT head, we
achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no
guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers
clear advantages and should be the new default for diffusion transformer
training.