世界行動模型:具身AI的下一前沿
World Action Models: The Next Frontier in Embodied AI
May 12, 2026
作者: Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, Yu-Gang Jiang
cs.AI
摘要
視覺-語言-行動(VLA)模型在具身策略學習中已展現出強大的語義泛化能力,然而此類模型僅學習反應式的觀測-動作映射,並未明確建模物理世界在干預下如何演變。為解決此限制,日益增多的研究將世界模型(即環境動態的預測模型)整合至動作生成流程中。我們將此新興範式稱為「世界行動模型」(World Action Models, WAMs):此類具身基礎模型統一了預測性狀態建模與動作生成,目標在於建立未來狀態與動作的聯合分佈,而非僅針對動作本身。然而,現有文獻在架構、學習目標及應用場景上仍顯零散,缺乏統一的理論框架。本文正式定義WAMs,釐清其與相關概念的區別,並追溯奠定此範式的VLA與世界模型研究之基礎與早期整合。我們將現有方法組織為級聯式與聯合式WAMs的結構化分類體系,並進一步依據生成模態、條件化機制及動作解碼策略進行細分。同時,系統性分析推動WAMs發展的數據生態系統,涵蓋機器人遠端操作、便攜式人類示範、模擬環境及網際網路規模的自我中心影片,並綜合歸納以視覺真實性、物理常識及動作合理性為核心的新興評估協議。整體而言,本綜述首次系統性闡述WAMs領域全景,釐清主要架構範式及其權衡取捨,並指出此快速演進領域中的開放性挑戰與未來機遇。
English
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.