一致性三位一体：通用世界模型的定义性原则

摘要

构建能够学习、模拟并推演客观物理规律的世界模型，是追求通用人工智能过程中的基础性挑战。以Sora为代表的视频生成模型的最新进展，展现了数据驱动尺度定律在逼近物理动力学方面的潜力，而新兴的统一多模态模型则为整合感知、语言与推理提供了有前景的架构范式。尽管取得这些进步，该领域仍缺乏界定通用世界模型必备属性的原则性理论框架。本文提出，世界模型必须植根于"三位一体一致性"：作为语义接口的模态一致性、作为几何基础的空间一致性，以及作为因果引擎的时间一致性。通过这一三重透镜，我们系统回顾多模态学习的演进历程，揭示出从松散耦合的专用模块向能协同涌现内部世界模拟器的统一架构的发展轨迹。为补充这一概念框架，我们推出以多帧推理与生成场景为核心的CoW-Bench基准测试平台。该平台在统一评估协议下对视频生成模型与统一多模态模型进行测评。本研究为通向通用世界模型建立了原则化路径，既明晰了现有系统的局限性，也指明了未来进展所需的架构要求。

English

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.