Dream-VL与Dream-VLA:基于扩散语言模型架构的开放视觉-语言及视觉-语言-行动模型
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
December 27, 2025
作者: Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
cs.AI
摘要
虽然自回归大型视觉语言模型(VLMs)已取得显著成功,但其序列化生成方式往往限制了在复杂视觉规划与动态机器人控制中的效能。本研究探索了基于扩散大语言模型(dLLMs)构建视觉语言模型的潜力以突破这些局限。我们提出Dream-VL——一种基于扩散的开放式视觉语言模型(dVLM),其在现有dVLM中实现了最先进的性能。Dream-VL在多项基准测试中与基于开放数据训练的一流自回归VLM表现相当,但在视觉规划任务中展现出更优潜力。基于Dream-VL,我们进一步推出Dream-VLA,这是一种通过对开放机器人数据集进行持续预训练开发的dLLM基视觉-语言-动作模型(dVLA)。我们证明该扩散骨架天然的 bidirectional 特性为VLA任务提供了更优基础,其天生适用于动作分块与并行生成,从而在下游微调中实现显著加速收敛。Dream-VLA在LIBERO上达到97.2%的平均成功率,在SimplerEnv-Bridge和SimplerEnv-Fractal上分别取得71.4%和60.5%的综合均值,超越了π_0、GR00T-N1等领先模型。我们还验证了在不同训练目标下,dVLM在下游任务中均优于自回归基线模型。我们开源Dream-VL与Dream-VLA以促进学界进一步研究。
English
While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as π_0 and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.