世界奥林匹克：你的世界模型能通过铁人三项吗？

摘要

我们介绍了WorldOlympiad，这是一个旨在从物理真实性、几何一致性和交互保真度三个维度诊断基于视频的世界模型的基准测试。现有基准测试往往聚焦于视觉质量、语义对齐或短期时间连贯性，但对生成视频是否遵循物理规则、保持一致的3D结构以及支持长期可控交互提供的洞察有限。为填补这一空白，WorldOlympiad将世界模型评估分解为三个互补维度。物理轨迹利用对象分割和大语言模型+视觉大模型（MLLM）作为评判者，评估生成视频是否符合力学、热现象和材料属性的可解释规则。几何轨迹通过高斯泼溅技术重建生成视频，并评估结构一致性、跨视角连贯性和相机轨迹对齐。交互轨迹则评估生成的展开序列是否遵循复杂的动作提示，并在连续视频块之间保持平滑连贯的过渡。WorldOlympiad进一步涵盖游戏、机器人和通用真实世界视频三大主要下游场景，捕捉从交互控制和具身操作到开放域运动与相机动力学的多样化挑战。这些轨迹与场景共同构成一个可扩展且可解释的评估套件，能够揭示超出通用视频质量范畴的失败模式。对当前最先进模型的实验表明，在物理推理、3D一致性和长期交互方面存在显著差距，这凸显了为生成式世界模型制定更结构化评估协议的必要性。

English

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.