Dream-VL 與 Dream-VLA:基於擴散語言模型架構的開放式視覺語言與視覺語言行動模型
Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
December 27, 2025
作者: Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
cs.AI
摘要
雖然基於自回歸的大型視覺語言模型(VLM)已取得顯著成就,但其序列化生成方式常限制其在複雜視覺規劃與動態機器人控制任務中的效能。本研究探討以擴散式大型語言模型(dLLM)為基礎構建視覺語言模型的潛力,以突破這些限制。我們提出 Dream-VL——一種開源的擴散式視覺語言模型(dVLM),其在同類模型中實現了最先進的性能。Dream-VL 在各大開放數據基準測試中可媲美頂級自回歸視覺語言模型,且在視覺規劃任務中展現出更優越的潛力。基於 Dream-VL,我們進一步推出 Dream-VLA,這是一款通過對開放機器人數據集進行持續預訓練而開發的 dLLM 架構視覺-語言-動作模型(dVLA)。我們證實該擴散架構天然的雙向特性為 VLA 任務提供了更優越的基礎,其天生適用於動作分塊與並行生成,從而在下游微調任務中實現顯著加速收斂。Dream-VLA 在 LIBERO 數據集上達到 97.2% 的平均成功率,在 SimplerEnv-Bridge 和 SimplerEnv-Fractal 上分別取得 71.4% 與 60.5% 的綜合平均分,超越 π_0 與 GR00T-N1 等主流模型。我們同時驗證了在不同訓練目標下,dVLM 於下游任務中均優於自回歸基線模型。現開源 Dream-VL 與 Dream-VLA 以推動相關領域的進一步研究。
English
While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as π_0 and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.