WorldOlympiad: 당신의 세계 모델은 트라이애슬론에서 살아남을 수 있을까?

초록

저희는 물리적 충실도, 기하학적 일관성, 상호작용 신뢰성 측면에서 비디오 기반 세계 모델을 진단하기 위한 벤치마크인 WorldOlympiad를 소개합니다. 기존 벤치마크는 종종 시각적 품질, 의미적 정합성, 또는 단기 시간적 일관성에 초점을 맞추지만, 생성된 비디오가 물리 법칙을 따르고, 일관된 3차원 구조를 유지하며, 장기간에 걸쳐 제어 가능한 상호작용을 지속하는지 여부에 대한 통찰력은 제한적입니다. 이러한 격차를 해소하기 위해 WorldOlympiad는 세계 모델 평가를 세 가지 상호 보완적인 차원으로 분해합니다. 물리 트랙은 객체 분할과 MLLM-as-judge를 사용하여 생성된 비디오가 역학, 열 현상, 재료 특성에서 해석 가능한 규칙을 따르는지 평가합니다. 기하학 트랙은 생성된 비디오를 가우시안 스플래팅으로 재구성하고 구조적 일관성, 교차 시점 일관성, 카메라 궤적 정렬을 평가합니다. 상호작용 트랙은 생성된 롤아웃이 복잡한 동작 프롬프트를 따르고 연속적인 비디오 청크 간에 부드럽고 일관된 전환을 유지하는지 평가합니다. 또한 WorldOlympiad는 게임, 로봇 공학, 일반 실제 세계 비디오를 포함한 세 가지 주요 다운스트림 시나리오를 다루며, 대화형 제어 및 구현된 조작부터 개방형 동작 및 카메라 역학에 이르기까지 다양한 과제를 포착합니다. 이러한 트랙과 시나리오는 함께 일반적인 비디오 품질을 넘어서는 실패 모드를 드러내는 확장 가능하고 해석 가능한 평가 제품군을 형성합니다. 최첨단 모델에 대한 실험은 물리적 추론, 3차원 일관성, 장기 상호작용에서 상당한 격차를 드러내며, 생성적 세계 모델을 위한 보다 체계적인 평가 프로토콜의 필요성을 강조합니다.

English

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.