ChatPaper.aiChatPaper

視覺生成調校

Visual Generation Tuning

November 28, 2025
作者: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
cs.AI

摘要

大型視覺語言模型(VLMs)通過大規模預訓練有效彌合了模態鴻溝,獲得了與語言對齊的複雜視覺表徵。然而,這些為多模態理解任務優化的表徵是否蘊藏視覺生成的內在潛能,目前仍缺乏深入探索。本文提出視覺生成調優(VGT)這一新範式,旨在激發任意視覺語言模型中潛在的視覺生成能力。通過對預訓練良好的VLMs進行高效視覺生成調優,我們顯著降低了對齊成本,並在連續空間中實現了自迴歸建模的加速收斂(提速20倍)。具體而言,我們摒棄了為擴散轉換器設計的複雜像素級VAE,通過將預訓練VLMs的語義編碼器與像素解碼器的潛在表徵對齊,構建了VGT-AE架構。在圖像重建任務中,我們以28倍壓縮比達成26.67 PSNR和0.50 rFID,超越專業VAE模型;在視覺生成任務中,我們在自迴歸模型中取得最先進成果——GenEval評分0.77,DPG-Bench評分78.73。此外,VGT展現出顯著的擴展潛力,能靈活賦予任何針對多模態理解訓練的VLMs以視覺生成能力,為探索下一代統一多模態基礎模型開闢了新路徑。模型與程式碼已開源於:https://github.com/hustvl/VGT。
English
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
PDF111December 10, 2025