WALL-WM: 在事件连接处进行世界动作建模的切割

摘要

WALL-WM是一种世界动作模型，它将视频-动作学习从以块为中心的优化转向基于事件的视觉-语言-动作预训练，把语义连贯的动作事件作为学习的基本单元。现有WAM通常从多模态或视频基础模型初始化，然后根据当前观测和指令直接优化固定长度的动作块。尽管方便，但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件，视觉通过连续场景动态演化，而动作运行在控制层面的时间尺度上；将三者强行纳入同一固定长度预测窗口，会使VLA训练退化为短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这一不匹配问题。具体而言，它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统相结合，从而实现对多样化行为、场景和任务结构的可扩展学习。基于同一事件预训练骨干网络，WALL-WM支持两种互补推理模式：事件模式消费下一事件描述并支持可变长度执行块，而统一模式则使用带阶梯式解码的VLM来约束常规固定长度块推理，同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施，WALL-WM为通用WAM提供了一套实用的规模化方案。实验表明，WALL-WM在语言、场景和任务上均具有广泛泛化能力，在大规模真实世界泛化评估中达到了最先进性能。

English

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.