一致性三位一體:通用世界模型的界定原則
The Trinity of Consistency as a Defining Principle for General World Models
February 26, 2026
作者: Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Xi bai, Chang Yu, Yumou Liu, Junnan Zhu, Xuanhe Zhou, Jintao Chen, Xiaobin Hu, Shancheng Pang, Bihui Yu, Ran He, Zhen Lei, Stan Z. Li, Conghui He, Shuicheng Yan, Cheng Tan
cs.AI
摘要
建構能夠學習、模擬並推演客觀物理規律的世界模型,是實現人工通用智慧的基礎性挑戰。以Sora為代表的影片生成模型近期進展,展現了數據驅動的規模化定律在逼近物理動力學方面的潛力,而新興的統一多模態模型(UMM)則為整合感知、語言與推理提供了極具前景的架構範式。儘管取得這些進展,該領域仍缺乏一個界定通用世界模型必備屬性的理論框架。本文提出世界模型必須奠基於「三重一致性」原則:作為語義介面的模態一致性、作為幾何基礎的空間一致性,以及作為因果引擎的時間一致性。透過此三元視角,我們系統性回顧多模態學習的演進軌跡,揭示其從鬆耦合的專用模組逐步邁向能協同湧現內部世界模擬器的統一架構。為完善此概念框架,我們提出以多幀推理與生成場景為核心的基準測試CoW-Bench,在統一評估協議下對影片生成模型與UMM進行評測。本研究為通向通用世界模型確立了原則性路徑,既闡明現有系統的局限性,也指明了未來進展所需的架構要件。
English
The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.