世界行動模型即為零樣本策略

摘要

當前最先進的視覺-語言-動作模型在語義泛化方面表現卓越，但在新環境中對未見過的物理動作泛化能力仍顯不足。我們提出DreamZero——基於預訓練影片擴散架構的世界行動模型。與VLA不同，WAM通過預測未來世界狀態與動作來學習物理動態，將影片視為世界演變的密集表徵。通過聯合建模影片與動作，DreamZero能從異構機器人數據中有效學習多樣化技能，無需依賴重複示範。真實機器人實驗表明，相較頂尖VLA模型，該模型在新任務與新環境的泛化能力提升逾兩倍。關鍵在於，通過模型與系統優化，我們實現了140億參數的自迴歸影片擴散模型以7Hz頻率進行即時閉環控制。最後，我們展示了兩種跨具身遷移形式：僅使用其他機器人或人類的純影片示範，僅需10-20分鐘數據就能在未見過任務上實現超過42%的相對性能提升；更令人驚奇的是，DreamZero實現了少樣本具身適應，僅需30分鐘的互動數據即可遷移至新具身形態，同時保持零樣本泛化能力。

English

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

世界行動模型即為零樣本策略

World Action Models are Zero-shot Policies

摘要

Support