WALL-WM: 이벤트 조인트에서 세계 행동 모델링 조각

초록

WALL-WM은 청크 중심 최적화에서 이벤트 기반 Vision-Language-Action 사전 학습으로 비디오-행동 학습을 전환하는 World Action Model로, 의미적으로 일관된 행동 이벤트를 학습의 원자 단위로 사용합니다. 기존 WAM은 일반적으로 멀티모달 또는 비디오 기반 모델로 초기화한 후, 현재 관찰과 명령에 직접 조건화된 고정 길이 행동 청크를 최적화합니다. 편리하지만, 이러한 청크 중심 정식화는 근본적인 세분성 불일치를 만듭니다. 언어는 의미적 목표와 이벤트를 설명하고, 비전은 연속적인 장면 역학을 통해 진화하며, 행동은 제어 수준 시간 척도로 작동합니다. 이 세 가지를 모두 동일한 고정 길이 예측 창에 강제하면 VLA 훈련이 단기 상관 관계 피팅으로 전환됩니다. WALL-WM은 감독과 데이터를 모두 의미적 이벤트 중심으로 구성함으로써 이러한 불일치를 해결합니다. 구체적으로, 이벤트 수준 캡션과 클러스터 균형 샘플링으로 구축된 데이터 생태계와 이벤트 기반 VLA 사전 학습을 결합하여 다양한 행동, 장면 및 작업 구조에 걸쳐 확장 가능한 학습을 가능하게 합니다. 동일한 이벤트 사전 학습된 백본에서 WALL-WM은 두 가지 상호 보완적인 추론 모드를 지원합니다. 이벤트 모드는 다음 이벤트 설명을 소비하고 가변 길이 실행 청크를 가능하게 하며, 통합 모드는 Staircase Decoding을 사용하는 VLM을 활용하여 기존의 고정 길이 청크 추론을 조건화하면서 그래디언트 연속 VLA 경로를 유지합니다. Muon 최적화 기반 대규모 사전 학습 인프라와 함께, WALL-WM은 범용 WAM을 위한 실용적인 확장 레시피를 제공합니다. 실험 결과, WALL-WM은 언어, 장면 및 작업 전반에 걸쳐 광범위하게 일반화되며, 대규모 실제 세계 일반화 평가에서 최첨단 성능을 달성함을 보여줍니다.

English

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.