ワールドアクションモデルはゼロショット方策である

要旨

最先端のVision-Language-Action（VLA）モデルは意味的な一般化には優れているものの、新規環境における未経験の物理動作への一般化には課題を抱えています。本論文では、事前学習済みビデオ拡散モデルを基盤としたWorld Action Model（WAM）であるDreamZeroを提案します。VLAとは異なり、WAMはビデオを世界の状態変化の高密度な表現として利用し、未来の世界状態と行動を予測することで物理ダイナミクスを学習します。ビデオと行動を共同でモデル化することにより、DreamZeroは反復的な実演に依存することなく、多様なロボットデータから効果的に多様なスキルを学習します。このアプローチにより、実ロボット実験において従来のVLAモデルと比較して、新規タスクや環境への一般化性能が2倍以上向上しました。重要なことに、モデルとシステムの最適化を通じて、140億パラメータの自己回帰型ビデオ拡散モデルが7Hzでのリアルタイム閉ループ制御を実現しています。最後に、2種類のクロスエンボディメント転移を実証します：他ロボットや人間によるビデオのみの実演データを10-20分使用するだけで、未経験タスクの性能が相対的に42%以上向上しました。さらに驚くべきことに、DreamZeroは少数ショットでのエンボディメント適応を可能とし、わずか30分のプレイデータで新しい身体形態へ転移しながら、ゼロショット一般化能力を維持します。

English

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

ワールドアクションモデルはゼロショット方策である

World Action Models are Zero-shot Policies

要旨

Support