具備表徵自動編碼器的擴散轉換器
Diffusion Transformers with Representation Autoencoders
October 13, 2025
作者: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
cs.AI
摘要
潛在生成建模,其中預訓練的自動編碼器將像素映射到擴散過程的潛在空間,已成為擴散變壓器(DiT)的標準策略;然而,自動編碼器組件幾乎沒有進化。大多數DiT仍然依賴於原始的VAE編碼器,這引入了幾個限制:過時的骨幹網絡損害了架構的簡潔性,低維潛在空間限制了信息容量,以及純基於重建的訓練導致的弱表示,最終限制了生成質量。在本研究中,我們探索用預訓練的表示編碼器(如DINO、SigLIP、MAE)配以訓練好的解碼器來替代VAE,形成我們所稱的表示自動編碼器(RAE)。這些模型不僅提供了高質量的重建,還具備語義豐富的潛在空間,同時允許基於變壓器的可擴展架構。由於這些潛在空間通常具有高維度,一個關鍵挑戰是使擴散變壓器能在其中有效運作。我們分析了這一困難的來源,提出了理論上合理的解決方案,並通過實驗驗證了它們。我們的方法在不依賴輔助表示對齊損失的情況下實現了更快的收斂。使用配備輕量級、寬DDT頭的DiT變體,我們在ImageNet上取得了強勁的圖像生成結果:256x256分辨率下無指導的FID為1.51,256x256和512x512分辨率下有指導的FID均為1.13。RAE提供了明顯的優勢,應成為擴散變壓器訓練的新標準。
English
Latent generative modeling, where a pretrained autoencoder maps pixels into a
latent space for the diffusion process, has become the standard strategy for
Diffusion Transformers (DiT); however, the autoencoder component has barely
evolved. Most DiTs continue to rely on the original VAE encoder, which
introduces several limitations: outdated backbones that compromise
architectural simplicity, low-dimensional latent spaces that restrict
information capacity, and weak representations that result from purely
reconstruction-based training and ultimately limit generative quality. In this
work, we explore replacing the VAE with pretrained representation encoders
(e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term
Representation Autoencoders (RAEs). These models provide both high-quality
reconstructions and semantically rich latent spaces, while allowing for a
scalable transformer-based architecture. Since these latent spaces are
typically high-dimensional, a key challenge is enabling diffusion transformers
to operate effectively within them. We analyze the sources of this difficulty,
propose theoretically motivated solutions, and validate them empirically. Our
approach achieves faster convergence without auxiliary representation alignment
losses. Using a DiT variant equipped with a lightweight, wide DDT head, we
achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no
guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers
clear advantages and should be the new default for diffusion transformer
training.