ChatPaper.aiChatPaper

SVG-T2I:无需变分自编码器的文本到图像潜在扩散模型规模化研究

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

December 12, 2025
作者: Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu
cs.AI

摘要

基于视觉基础模型(VFM)表征的视觉生成为融合视觉理解、感知与生成提供了一条极具前景的统一路径。尽管潜力巨大,但在VFM表征空间内完整训练大规模文生图扩散模型的研究仍属空白。为填补这一空白,我们扩展了SVG(自监督视觉生成表征)框架,提出SVG-T2I以直接在VFM特征域实现高质量文生图合成。通过采用标准文生图扩散流程,SVG-T2I取得了具有竞争力的性能:在GenEval上达到0.75分,在DPG-Bench上获得85.78分。这一性能验证了VFM在生成任务中固有的表征能力。我们已将项目完全开源,包括自动编码器与生成模型,及其训练、推理、评估流程与预训练权重,以推动表征驱动视觉生成的进一步研究。
English
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
PDF343December 17, 2025