DreamVLA:一個融合全面世界知識的視覺-語言-行動模型
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
July 6, 2025
作者: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
cs.AI
摘要
近期,視覺-語言-動作(VLA)模型的進展展現了將圖像生成與動作預測相結合以提升機器人操作中的泛化與推理能力的潛力。然而,現有方法僅限於挑戰性的基於圖像的預測,這存在信息冗餘且缺乏全面且關鍵的世界知識,包括動態、空間及語義信息。為解決這些限制,我們提出了DreamVLA,一種新穎的VLA框架,它整合了全面的世界知識預測,從而實現逆動力學建模,為操作任務建立感知-預測-動作的閉環。具體而言,DreamVLA引入了動態區域引導的世界知識預測,結合空間與語義線索,為動作規劃提供緊湊而全面的表示。這一設計符合人類與世界互動的方式,即在行動前先形成抽象的多模態推理鏈。為減輕訓練過程中動態、空間及語義信息間的相互干擾,我們採用了分塊結構化注意力機制,遮蔽它們之間的相互注意力,防止信息洩露並保持每種表示的純淨與解耦。此外,為建模未來動作的條件分佈,我們採用了基於擴散的變壓器,將動作表示從共享的潛在特徵中解耦出來。在真實世界與模擬環境中的大量實驗表明,DreamVLA在真實機器人任務上達到了76.7%的成功率,並在CALVIN ABC-D基準測試中取得了4.44的平均長度。
English
Recent advances in vision-language-action (VLA) models have shown promise in
integrating image generation with action prediction to improve generalization
and reasoning in robot manipulation. However, existing methods are limited to
challenging image-based forecasting, which suffers from redundant information
and lacks comprehensive and critical world knowledge, including dynamic,
spatial and semantic information. To address these limitations, we propose
DreamVLA, a novel VLA framework that integrates comprehensive world knowledge
forecasting to enable inverse dynamics modeling, thereby establishing a
perception-prediction-action loop for manipulation tasks. Specifically,
DreamVLA introduces a dynamic-region-guided world knowledge prediction,
integrated with the spatial and semantic cues, which provide compact yet
comprehensive representations for action planning. This design aligns with how
humans interact with the world by first forming abstract multimodal reasoning
chains before acting. To mitigate interference among the dynamic, spatial and
semantic information during training, we adopt a block-wise structured
attention mechanism that masks their mutual attention, preventing information
leakage and keeping each representation clean and disentangled. Moreover, to
model the conditional distribution over future actions, we employ a
diffusion-based transformer that disentangles action representations from
shared latent features. Extensive experiments on both real-world and simulation
environments demonstrate that DreamVLA achieves 76.7% success rate on real
robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.