SVG-T2I:無需變分自編碼器的文字生成圖像潛在擴散模型規模化研究
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
December 12, 2025
作者: Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu
cs.AI
摘要
基於視覺基礎模型(VFM)表徵的視覺生成,為整合視覺理解、感知與生成提供了一條極具前景的統一途徑。儘管存在此潛力,在VFM表徵空間內完全訓練大規模文生圖擴散模型的研究仍屬空白。為填補此缺口,我們擴展了SVG(自監督視覺生成表徵)框架,提出SVG-T2I以直接在VFM特徵域中實現高質量文生圖合成。通過採用標準文生圖擴散流程,SVG-T2I達到了可媲美現有技術的性能,在GenEval上獲得0.75分,在DPG-Bench上取得85.78分。此性能驗證了VFM表徵在生成任務中的內在潛力。我們將項目完整開源,包含自動編碼器與生成模型,以及其訓練、推理、評估流程和預訓練權重,以促進表徵驅動視覺生成的後續研究。
English
Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.