世界行动模型:具身AI的下一个前沿
World Action Models: The Next Frontier in Embodied AI
May 12, 2026
作者: Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, Yu-Gang Jiang
cs.AI
摘要
视觉-语言-动作(Vision-Language-Action,VLA)模型在具身策略学习中展现出强大的语义泛化能力,但这些模型仅学习反应性的观测到动作映射,未明确建模物理世界在干预下如何演变。为突破这一局限,越来越多的工作将世界模型(即环境动态的预测模型)集成到动作生成流程中。我们将这一新兴范式定义为**世界动作模型(World Action Models,WAMs)**:一种将预测性状态建模与动作生成相统一的具身基础模型,其目标是学习未来状态与动作的联合分布,而非仅关注动作本身。然而,当前研究在架构、学习目标和应用场景上仍显碎片化,缺乏统一的概念框架。本文对WAMs进行正式定义,厘清其与相关概念的差异,追溯催生该范式的VLA与世界模型研究基础与早期融合路径。我们基于级联式与联合式WAMs构建结构化分类体系,进一步依据生成模态、条件机制和动作解码策略进行细分。系统梳理驱动WAMs发展的数据生态(涵盖机器人遥操作、便携式人类演示、仿真环境及互联网规模第一人称视频),并整合围绕视觉保真度、物理常识与动作合理性形成的新兴评估协议。总体而言,本综述首次系统描绘WAMs研究图景,阐明关键架构范式及其权衡,揭示这一快速演进领域的开放挑战与未来机遇。
English
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.