DreamVLA: 包括的な世界知識を備えたビジョン・言語・アクションモデル

要旨

近年の視覚-言語-行動（VLA）モデルの進展は、画像生成と行動予測を統合することで、ロボット操作における汎化能力と推論能力の向上に有望な成果を示しています。しかし、既存の手法は画像ベースの予測に限定されており、冗長な情報や動的・空間的・意味的知識を含む包括的かつ重要な世界知識が欠如しているという課題があります。これらの制限を解決するため、我々はDreamVLAを提案します。これは、包括的な世界知識予測を統合し、逆動力学モデリングを可能にする新しいVLAフレームワークであり、操作タスクのための知覚-予測-行動ループを確立します。具体的には、DreamVLAは動的領域ガイドによる世界知識予測を導入し、空間的および意味的手がかりと統合することで、行動計画のためのコンパクトかつ包括的な表現を提供します。この設計は、人間が行動する前に抽象的なマルチモーダル推論チェーンを形成する方法に沿っています。動的・空間的・意味的情報間の干渉を軽減するため、ブロック構造化された注意メカニズムを採用し、相互の注意をマスキングすることで情報漏洩を防ぎ、各表現をクリーンで分離された状態に保ちます。さらに、将来の行動に対する条件付き分布をモデル化するために、共有潜在特徴から行動表現を分離する拡散ベースのトランスフォーマーを採用します。実世界およびシミュレーション環境での広範な実験により、DreamVLAは実ロボットタスクで76.7%の成功率を達成し、CALVIN ABC-Dベンチマークで4.44の平均長を記録することが示されました。

English

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

DreamVLA: 包括的な世界知識を備えたビジョン・言語・アクションモデル

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

要旨

Support