ChatPaper.aiChatPaper

WorldBench:面向世界模型诊断评估的物理歧义消解基准

WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

January 29, 2026
作者: Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, Achuta Kadambi
cs.AI

摘要

生成式基础模型(常被称为"世界模型")的最新进展,激发了人们将其应用于机器人规划与自主系统训练等关键任务的兴趣。为确保可靠部署,这些模型必须具备高物理保真度,能精准模拟现实世界动态。然而现有基于物理的视频基准存在概念纠缠问题——单个测试同时评估多个物理定律与概念,这从根本上限制了其诊断能力。我们推出WorldBench这一新型视频基准,专门针对特定概念进行解耦评估,可严格分离并逐项检验对单一物理概念或定律的理解。为使WorldBench具备全面性,我们设计了两级基准:1)评估对物体恒存性、尺度/透视等直观物理概念的理解;2)评估对摩擦系数、流体黏度等底层物理常量与材料属性的认知。当基于视频的顶尖世界模型在WorldBench上接受测试时,我们发现所有模型均在特定物理概念上存在系统性缺陷,缺乏生成可靠真实世界交互所需的物理一致性。通过这种针对性评估框架,WorldBench为严格检验视频生成与世界模型的物理推理能力提供了更精细、可扩展的解决方案,为构建更强健、泛化能力更强的世界模型驱动学习铺平道路。
English
Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.
PDF02January 31, 2026