ChatPaper.aiChatPaper

基于表征自编码器的文本到图像扩散变换器规模化研究 (注:该标题采用学术论文常见的"方法+对象+研究范畴"的翻译策略,其中: 1. "Scaling"译为"规模化研究"以体现技术扩展的学术内涵 2. "Diffusion Transformers"保留核心概念译为"扩散变换器" 3. "Representation Autoencoders"译为专业术语"表征自编码器" 4. 通过"基于...的"句式保持中文论文标题的语法规范性)

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

January 22, 2026
作者: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
cs.AI

摘要

表征自编码器(RAE)通过在高层语义隐空间中进行训练,已在ImageNet的扩散建模中展现出独特优势。本研究旨在探究该框架能否扩展至大规模自由格式的文生图(T2I)生成领域。我们首先基于冻结的表征编码器(SigLIP-2),通过使用网络数据、合成数据及文本渲染数据进行训练,将RAE解码器的规模扩展至超越ImageNet范畴。研究发现:虽然扩大规模能提升整体保真度,但针对文本等特定领域需采用定向数据组合策略。随后我们系统验证了原为ImageNet设计的RAE架构选择,发现规模扩展会简化框架:尽管维度相关的噪声调度仍至关重要,但宽扩散头、噪声增强解码等复杂结构在大规模场景下收效甚微。基于此简化框架,我们在0.5B至9.8B参数规模的扩散Transformer上对RAE与最先进的FLUX VAE进行对照实验。结果表明:在所有模型规模下,RAE在预训练阶段持续优于VAE;在高质量数据集上微调时,VAE模型在64轮迭代后出现灾难性过拟合,而RAE模型在256轮迭代中保持稳定且性能持续提升。所有实验均表明,基于RAE的扩散模型具有更快的收敛速度和更优的生成质量,证明RAE是大规模T2I生成中比VAE更简洁高效的基座。此外,由于视觉理解与生成可在共享表征空间中运行,多模态模型能直接对生成隐变量进行推理,这为构建统一模型开辟了新路径。
English
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
PDF401January 24, 2026