ChatPaper.aiChatPaper

基於表徵自動編碼器的文字到影像擴散變壓器規模化研究 (註:此標題採用學術界常用譯法,其中"Diffusion Transformers"結合技術特性譯為「擴散變壓器」,「Representation Autoencoders」遵循自動編碼器技術體系譯為「表徵自動編碼器」,整體保持技術術語的準確性與學術論文的簡潔風格)

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

January 22, 2026
作者: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
cs.AI

摘要

表徵自動編碼器(RAE)透過在高維語意潛在空間中訓練,已在ImageNet的擴散建模中展現出獨特優勢。本研究探討此框架能否擴展至大規模自由格式的文生圖(T2I)生成。我們首先基於凍結的表徵編碼器(SigLIP-2),利用網路數據、合成數據及文字渲染數據擴展RAE解碼器的規模,發現雖然規模化能提升整體擬真度,但針對文字等特定領域仍需精準的數據組合策略。接著我們嚴格驗證原為ImageNet設計的RAE架構選擇,發現規模化會簡化框架:維度相關的噪聲調度仍至關重要,但擴散頭寬度設計和噪聲增強解碼等複雜結構在大規模下效益微乎其微。基於此簡化框架,我們在0.5B至9.8B參數規模的擴散轉換器上,對RAE與最先進的FLUX VAE進行對照實驗。RAE在所有模型規模的預訓練階段均持續優於VAE;而在高質量數據集上微調時,VAE模型在64個周期後出現災難性過擬合,RAE模型則能穩定訓練至256個周期且性能持續領先。所有實驗表明,基於RAE的擴散模型具有更快的收斂速度和更優的生成質量,證明RAE是大規模T2I生成中比VAE更簡潔強效的基礎架構。此外,由於視覺理解與生成可共享表徵空間,多模態模型能直接對生成潛在表徵進行推理,為統一模型開闢了新可能性。
English
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
PDF401January 24, 2026