WALL-WM: イベント関節点における世界行動モデリングの切り出し

要旨

WALL-WMは、映像-行動学習をチャンク中心の最適化からイベント基盤の視覚-言語-行動（VLA）事前学習へと転換させる世界行動モデルであり、意味的に一貫した行動イベントを学習の基本単位として用いる。既存のWAMは一般にマルチモーダルまたは映像基盤モデルから初期化され、現在の観測と指示に直接条件付けられた固定長の行動チャンクを最適化する。便利ではあるが、このチャンク中心の定式化は根本的な粒度のミスマッチを生む。言語は意味的な目標やイベントを記述し、映像は連続的なシーンのダイナミクスを通じて変化し、行動は制御レベルの時間スケールで動作する。これら三つを同一の固定長予測ウィンドウに押し込むことは、VLA学習を短期相関のフィッティングに変えてしまう。WALL-WMは、教師信号とデータの両方を意味的イベントに基づいて整理することで、このミスマッチに対処する。具体的には、イベントレベルのキャプションとクラスタバランスサンプリングから構築されたデータエコシステムと組み合わせたイベント基盤VLA事前学習を導入し、多様な行動、シーン、タスク構造にわたるスケーラブルな学習を可能にする。同一のイベント事前学習済みバックボーンから、WALL-WMは二つの相補的な推論モードをサポートする。イベントモードは次のイベント記述を入力として可変長の実行チャンクを生成し、統合モードはStaircase Decodingを用いたVLMにより従来の固定長チャンク推論を条件付けつつ、勾配連続なVLA経路を保持する。Muonオプティマイザに基づく大規模事前学習インフラと相まって、WALL-WMは汎用WAMのための実用的なスケールアップレシピを提供する。実験では、WALL-WMが言語、シーン、タスクにわたり広範に汎化し、大規模実世界汎化評価において最先端の性能を達成することを示す。

English

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.