ChatPaper.aiChatPaper

DiffusionVL:将任意自回归模型转化为扩散式视觉语言模型

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

December 17, 2025
作者: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
cs.AI

摘要

在近期多模态研究中,扩散范式因其独特的解码优势,已成为自回归范式(AR)的重要替代方案。然而受基础扩散语言模型的能力限制,扩散视觉语言模型(dVLM)的性能仍显著落后于主流模型。这引出一个简单而根本的问题:能否基于现有强大的AR模型构建dVLM?对此,我们提出DiffusionVL——一个可从任意强大AR模型转换而来的dVLM家族。通过简单微调,我们成功将AR预训练模型适配至扩散范式,并得出两个关键发现:(1)从基于AR的多模态模型向扩散范式的转换异常高效;(2)将AR语言模型直接转换为dVLM具有可行性,其性能可与LLaVA风格的视觉指令调优相媲美。此外,我们在dVLM中引入支持任意长度生成和KV缓存复用的分块解码设计,实现了推理速度的显著提升。大量实验表明:尽管训练数据量不足现有方法的5%,DiffusionVL在MMMU-Pro(视觉)基准上提升34.4%,在MME(认知)基准上提升37.5%,同时实现2倍推理加速。模型与代码已发布于https://github.com/hustvl/DiffusionVL。
English
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
PDF132December 19, 2025