WorldOlympiad: あなたのワールドモデルはトライアスロンを生き残れるか？

要旨

WorldOlympiadを紹介する。これは、ビデオベースの世界モデルを、物理的忠実性、幾何学的整合性、インタラクションの確実性の観点から診断するためのベンチマークである。既存のベンチマークは、多くの場合、視覚品質、意味的整合性、または短期間の時間的コヒーレンスに焦点を当てているため、生成されたビデオが物理法則に従っているか、一貫した3D構造を保持しているか、長期間にわたって制御可能なインタラクションを維持しているかについての洞察は限定的である。このギャップを埋めるため、WorldOlympiadは世界モデルの評価を3つの補完的な次元に分解する。物理トラックでは、オブジェクトセグメンテーションとMLLM-as-judgeを使用して、生成されたビデオが力学、熱現象、材料特性における解釈可能なルールに従っているかを評価する。幾何学トラックでは、生成されたビデオをガウシアンスプラッティングで再構成し、構造的一貫性、視点間のコヒーレンス、カメラ軌道の整合性を評価する。インタラクショントラックでは、生成されたロールアウトが複雑なアクションプロンプトに従い、連続するビデオチャンク間で滑らかで一貫性のある遷移を維持しているかを評価する。WorldOlympiadはさらに、ゲーム、ロボティクス、一般的な実世界のビデオを含む3つの主要な下流シナリオをカバーし、インタラクティブな制御や身体的操作から、オープンドメインの動作やカメラのダイナミクスに至るまで、多様な課題を捉える。これらのトラックとシナリオは、スケーラブルで解釈可能な評価スイートを構成し、一般的なビデオ品質を超えた障害モードを明らかにする。最先端モデルに対する実験では、物理的推論、3D一貫性、長期的インタラクションにおける顕著なギャップが明らかになり、生成的世界モデルのためのより構造化された評価プロトコルの必要性が強調される。

English

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.