DreamVLA:一个融合全面世界知识的视觉-语言-动作模型
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
July 6, 2025
作者: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
cs.AI
摘要
近期,视觉-语言-动作(VLA)模型的进展展现了将图像生成与动作预测相结合以提升机器人操作任务中泛化与推理能力的潜力。然而,现有方法局限于基于图像的预测挑战,这些方法存在信息冗余且缺乏全面且关键的世界知识,包括动态、空间及语义信息。为克服这些局限,我们提出了DreamVLA,一种新颖的VLA框架,它整合了全面的世界知识预测,以实现逆向动力学建模,从而为操作任务构建感知-预测-动作闭环。具体而言,DreamVLA引入了动态区域引导的世界知识预测,结合空间与语义线索,为动作规划提供紧凑而全面的表示。这一设计符合人类与世界的交互方式,即在行动前先形成抽象的多模态推理链。为减少训练过程中动态、空间及语义信息间的相互干扰,我们采用了分块结构化注意力机制,屏蔽它们之间的相互关注,防止信息泄露,保持每种表示的纯净与解耦。此外,为建模未来动作的条件分布,我们利用基于扩散的Transformer,将动作表示从共享潜在特征中解耦出来。在真实世界与仿真环境中的大量实验表明,DreamVLA在真实机器人任务上实现了76.7%的成功率,并在CALVIN ABC-D基准测试中取得了4.44的平均长度。
English
Recent advances in vision-language-action (VLA) models have shown promise in
integrating image generation with action prediction to improve generalization
and reasoning in robot manipulation. However, existing methods are limited to
challenging image-based forecasting, which suffers from redundant information
and lacks comprehensive and critical world knowledge, including dynamic,
spatial and semantic information. To address these limitations, we propose
DreamVLA, a novel VLA framework that integrates comprehensive world knowledge
forecasting to enable inverse dynamics modeling, thereby establishing a
perception-prediction-action loop for manipulation tasks. Specifically,
DreamVLA introduces a dynamic-region-guided world knowledge prediction,
integrated with the spatial and semantic cues, which provide compact yet
comprehensive representations for action planning. This design aligns with how
humans interact with the world by first forming abstract multimodal reasoning
chains before acting. To mitigate interference among the dynamic, spatial and
semantic information during training, we adopt a block-wise structured
attention mechanism that masks their mutual attention, preventing information
leakage and keeping each representation clean and disentangled. Moreover, to
model the conditional distribution over future actions, we employ a
diffusion-based transformer that disentangles action representations from
shared latent features. Extensive experiments on both real-world and simulation
environments demonstrate that DreamVLA achieves 76.7% success rate on real
robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.