WoW: 체화된 상호작용을 통한 전지적 세계 모델 구축을 향하여

초록

인간은 세계와의 능동적인 상호작용을 통해 직관적 물리학에 대한 이해를 발전시킨다. 이 접근 방식은 현재의 비디오 모델(예: Sora)과는 극명히 대조되는데, 이러한 모델들은 수동적 관찰에 의존하기 때문에 물리적 인과관계를 파악하는 데 어려움을 겪는다. 이러한 관찰은 우리의 중심 가설로 이어진다: 세계 모델의 진정한 물리적 직관은 실세계와의 광범위하고 인과적으로 풍부한 상호작용에 기반해야 한다. 이 가설을 검증하기 위해, 우리는 200만 개의 로봇 상호작용 궤적을 기반으로 훈련된 140억 개의 파라미터를 가진 생성적 세계 모델인 WoW를 제시한다. 우리의 연구 결과는 이 모델의 물리학 이해가 가능한 결과들의 확률적 분포로 나타나며, 이로 인해 확률적 불안정성과 물리적 환각이 발생함을 보여준다. 더 나아가, 이러한 창발적 능력이 SOPHIA를 통해 물리적 현실성으로 능동적으로 제한될 수 있음을 입증한다. 여기서 비전-언어 모델 에이전트들은 DiT 생성 출력을 평가하고 언어 지시를 반복적으로 진화시켜 이를 개선한다. 또한, 공동 훈련된 역동역학 모델은 이러한 개선된 계획을 실행 가능한 로봇 동작으로 변환함으로써 상상에서 행동으로의 루프를 완성한다. 우리는 물리적 일관성과 인과적 추론에 초점을 맞춘 새로운 벤치마크인 WoWBench를 구축했으며, WoW는 인간 및 자동 평가에서 최첨단 성능을 달성하며 물리적 인과관계, 충돌 역학, 객체 영속성에서 강력한 능력을 보여준다. 우리의 연구는 대규모 실세계 상호작용이 AI에서 물리적 직관을 개발하는 데 있어 핵심 요소임을 체계적으로 입증한다. 모델, 데이터, 벤치마크는 오픈소스로 공개될 예정이다.

English

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

WoW: 체화된 상호작용을 통한 전지적 세계 모델 구축을 향하여

WoW: Towards a World omniscient World model Through Embodied Interaction

초록

Support