WALL-WM: 在事件连接处进行世界动作建模的切割
WALL-WM: Carving World Action Modeling at the Event Joints
June 1, 2026
作者: Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang
cs.AI
摘要
WALL-WM是一种世界动作模型,它将视频-动作学习从以块为中心的优化转向基于事件的视觉-语言-动作预训练,把语义连贯的动作事件作为学习的基本单元。现有WAM通常从多模态或视频基础模型初始化,然后根据当前观测和指令直接优化固定长度的动作块。尽管方便,但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件,视觉通过连续场景动态演化,而动作运行在控制层面的时间尺度上;将三者强行纳入同一固定长度预测窗口,会使VLA训练退化为短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这一不匹配问题。具体而言,它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统相结合,从而实现对多样化行为、场景和任务结构的可扩展学习。基于同一事件预训练骨干网络,WALL-WM支持两种互补推理模式:事件模式消费下一事件描述并支持可变长度执行块,而统一模式则使用带阶梯式解码的VLM来约束常规固定长度块推理,同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施,WALL-WM为通用WAM提供了一套实用的规模化方案。实验表明,WALL-WM在语言、场景和任务上均具有广泛泛化能力,在大规模真实世界泛化评估中达到了最先进性能。
English
WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.