视频生成作为世界模型:状态与动态的机制视角
A Mechanistic View on Video Generation as World Models: State and Dynamics
January 22, 2026
作者: Luozhou Wang, Zhifei Chen, Yihua Du, Dongyu Yan, Wenhang Ge, Guibao Shen, Xinli Xu, Leyi Wu, Man Chen, Tianshuo Xu, Peiran Ren, Xin Tao, Pengfei Wan, Ying-Cong Chen
cs.AI
摘要
大规模视频生成模型已展现出涌现的物理连贯性,使其具备成为世界模型的潜力。然而,当代"无状态"视频架构与经典以状态为核心的世界模型理论之间仍存在隔阂。本研究通过提出以"状态构建"和"动态建模"为双支柱的新分类法弥合这一鸿沟:将状态构建划分为隐式范式(上下文管理)与显式范式(潜在压缩),同时从知识整合和架构重构两个维度解析动态建模。此外,我们主张评估体系应从视觉保真度转向功能基准测试,重点考察物理持续性与因果推理能力。最后提出两大关键前沿方向:通过数据驱动记忆与压缩保真度提升持续性,借助潜在因子解耦与推理先验融合推进因果建模。通过突破这些挑战,该领域有望从生成视觉合理的视频演进为构建鲁棒的通用世界模拟器。
English
Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning-prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.