ChatPaper.aiChatPaper

视觉生成调优

Visual Generation Tuning

November 28, 2025
作者: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
cs.AI

摘要

大型视觉语言模型(VLMs)通过大规模预训练有效弥合了模态鸿沟,获得了与语言对齐的复杂视觉表征。然而,这些为多模态理解任务优化的表征是否蕴含视觉生成的内在潜力,目前仍缺乏深入探索。本文提出视觉生成调优(VGT)这一新范式,旨在激发任意视觉语言模型中潜藏的视觉生成能力。通过对预训练良好的VLMs进行高效的视觉生成调优,我们显著降低了连续空间自回归建模的对齐成本并加速其收敛(提速20倍)。具体而言,我们摒弃了为扩散变换器设计的纠缠式像素级VAE,通过将预训练VLMs的语义编码器与像素解码器的潜在表征对齐,构建了VGT-AE框架。在图像重建任务中,我们以28倍压缩比实现了26.67 PSNR和0.50 rFID,超越专业VAE模型;在视觉生成任务中,我们在自回归模型中取得最先进成果——GenEval得分0.77,DPG-Bench得分78.73。此外,VGT展现出显著的扩展潜力,可灵活赋能任何面向多模态理解训练的VLMs具备视觉生成能力,这为探索下一代统一多模态基础模型开辟了新路径。模型与代码已开源:https://github.com/hustvl/VGT。
English
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
PDF111December 10, 2025