视频生成模型作为世界模拟器：高效范式、架构与算法

摘要

视频生成技术的快速发展已使模型能够模拟复杂的物理动力学和长程因果关系，使其成为潜在的世界模拟器。然而，在理论上的世界模拟能力与时空建模的沉重计算成本之间仍存在关键差距。为此，我们全面系统地梳理了将效率作为实用世界建模核心要求的视频生成框架与技术，提出涵盖高效建模范式、高效网络架构和高效推理算法的三维分类体系。研究进一步表明，弥合效率鸿沟将直接赋能自动驾驶、具身人工智能和游戏仿真等交互应用。最后，我们指出了高效视频世界建模的新兴研究前沿，论证了效率是推动视频生成器进化为通用、实时、鲁棒的世界模拟器的根本前提。

English

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.