世界奧林匹克：你的世界模型能通過鐵人三項嗎？

摘要

我們介紹WorldOlympiad，這是一個用於診斷基於影片的世界模型在物理忠實度、幾何一致性與互動保真度方面的基準測試。雖然現有基準通常聚焦於視覺品質、語意對齊或短期時間連貫性，但它們對於生成的影片是否遵守物理規則、維持連貫的三維結構，以及支援長時程的可控互動，所提供的洞察相當有限。為填補此缺口，WorldOlympiad將世界模型的評估分解為三個互補的維度：物理軌道使用物件分割與多模態大型語言模型（MLLM）作為評判，評估生成的影片是否遵循力學、熱現象與材料屬性中的可解釋規則；幾何軌道透過高斯潑濺法重建生成的影片，並評估結構一致性、跨視角連貫性與相機軌跡對齊；互動軌道則評估生成的展開是否遵循複雜的動作提示，並在連續的影片片段之間維持平滑且連貫的過渡。此外，WorldOlympiad涵蓋三個主要的下游場景，包括遊戲、機器人以及通用的真實世界影片，捕捉從互動控制與具身操作到開放域運動與相機動態的多樣化挑戰。這些軌道與場景共同構成一個可擴展且可解釋的評估套件，能夠揭露超越一般影片品質的失敗模式。針對當前最先進模型的實驗顯示，在物理推理、三維一致性以及長時程互動方面存在顯著差距，凸顯了為生成式世界模型建立更具結構化評估協議的必要性。

English

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.