ChatPaper.aiChatPaper

WALL-WM:在事件關節處刻劃世界行動建模

WALL-WM: Carving World Action Modeling at the Event Joints

June 1, 2026
作者: Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang
cs.AI

摘要

WALL-WM 是一個世界動作模型,將視覺-動作學習從以片段為中心的優化轉變為事件導向的視覺-語言-動作預訓練,並以語義連貫的動作事件作為學習的基本單位。現有的世界動作模型通常從多模態或視覺基礎模型初始化,然後直接根據當前觀察和指令優化固定長度的動作片段。儘管這種方法便利,但以片段為中心的設計造成了基本的粒度不匹配問題。語言描述語義目標與事件,視覺透過連續場景動態演變,而動作則在控制層級的時間尺度上運行;將三者強行納入相同的固定長度預測窗口,會使視覺-語言-動作訓練淪為短時程相關性擬合。WALL-WM 透過圍繞語義事件組織監督信號與資料來解決此不匹配問題。具體而言,它將事件導向的視覺-語言-動作預訓練與基於事件層級描述及聚類平衡取樣的資料生態系統相結合,從而能在多樣的行為、場景與任務結構上進行可擴展學習。基於相同的事件預訓練骨幹,WALL-WM 支援兩種互補的推理模式:事件模式可消費下一個事件的描述,並執行可變長度的動作片段;而統一模式則利用具備階梯式解碼的視覺語言模型,來調節傳統的固定長度片段推理,同時保留梯度連續的視覺-語言-動作路徑。搭配基於 Muon 優化器的大規模預訓練基礎設施,WALL-WM 為通用世界動作模型提供了實用的擴展配方。實驗結果顯示,WALL-WM 在語言、場景與任務上具有廣泛的泛化能力,並在大規模真實世界泛化評估中達到了最先進的性能。
English
WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.