ChatPaper.aiChatPaper

基于表征自编码器的扩散变换器

Diffusion Transformers with Representation Autoencoders

October 13, 2025
作者: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
cs.AI

摘要

潜变量生成建模,即通过预训练的自编码器将像素映射到扩散过程的潜在空间中,已成为扩散变换器(DiT)的标准策略;然而,自编码器组件几乎未有进展。多数DiT仍依赖原始的VAE编码器,这带来了几项局限:过时的骨干网络损害了架构的简洁性,低维潜在空间限制了信息容量,以及纯粹基于重建的训练导致表征能力不足,最终影响了生成质量。在本研究中,我们探索用预训练的表征编码器(如DINO、SigLIP、MAE)配合训练的解码器替换VAE,构建了我们称之为表征自编码器(RAE)的模型。这些模型不仅提供高质量的重建结果和语义丰富的潜在空间,还支持可扩展的基于变换器的架构。鉴于这些潜在空间通常为高维,一个关键挑战是如何使扩散变换器在其中高效运作。我们分析了这一难题的根源,提出了理论驱动的解决方案,并通过实验验证了其有效性。我们的方法在不依赖辅助表征对齐损失的情况下实现了更快的收敛。采用配备轻量级、宽DDT头的DiT变体,我们在ImageNet上取得了显著的图像生成成果:256x256分辨率下无引导的FID为1.51,256x256和512x512分辨率下有引导的FID均为1.13。RAE展现出明显优势,应成为扩散变换器训练的新基准。
English
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.
PDF1575October 14, 2025