WorldBench:面向世界模型诊断性评估的物理歧义消解框架
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models
January 29, 2026
作者: Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, Achuta Kadambi
cs.AI
摘要
近期,生成式基础模型(常被称为"世界模型")的技术突破激发了将其应用于机器人规划、自主系统训练等关键任务的兴趣。为确保可靠部署,这些模型需具备高物理保真度,能精准模拟现实世界动态。然而,现有基于物理的视频基准测试存在概念纠缠问题——单个测试同时评估多重物理定律与概念,这从根本上限制了其诊断能力。我们推出WorldBench这一新型视频基准测试,专门针对特定概念进行解耦评估,可严格分离并逐一检验对单一物理概念或定律的理解。为构建全面评估体系,我们设计了两个层级的测试:1)评估直觉物理认知(如物体恒存性、尺度/透视关系);2)评估底层物理常数与材料属性(如摩擦系数、流体粘度)。通过对当前最先进的视频世界模型进行WorldBench测试,我们发现所有被测模型均存在特定物理概念的理解缺陷,缺乏生成可靠真实世界交互所需的物理一致性。WorldBench通过概念特异性评估,为视频生成与世界模型的物理推理能力提供了更精细、可扩展的严谨评估框架,为开发更稳健、可泛化的世界模型驱动学习开辟了新路径。
English
Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.