Omni-WorldBench：迈向面向世界模型的综合交互中心评估框架

摘要

基于视频的世界模型主要沿着两大范式发展：视频生成与三维重建。然而，现有评估基准要么局限于生成模型的视觉保真度和文本-视频对齐能力，要么依赖静态三维重建指标，本质上忽略了时序动态特性。我们认为世界建模的未来在于四维生成——即对空间结构和时序演化的联合建模。该范式的核心能力在于交互响应：即准确反映交互行为如何驱动时空状态转换的能力。但现有基准尚未系统评估这一关键维度。为填补这一空白，我们提出Omni-WorldBench，这是一个专门用于评估四维场景下世界模型交互响应能力的综合基准。该基准包含两大核心组件：Omni-WorldSuite——涵盖多层级交互类型与场景类别的系统性提示词集；以及Omni-Metrics——基于智能体的评估框架，通过量化交互行为对最终结果和中间状态演化轨迹的因果影响，来衡量世界建模能力。我们对18个代表性世界模型进行了多范式广泛评估，分析揭示了当前模型在交互响应方面的关键局限，为未来研究提供了可操作的见解。Omni-WorldBench将公开发布以推动交互式四维世界建模的发展。

English

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.