ChatPaper.aiChatPaper

人工智能模拟视觉世界:发展路线图

Simulating the Visual World with Artificial Intelligence: A Roadmap

November 11, 2025
作者: Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu
cs.AI

摘要

视频生成领域正经历着从关注生成视觉吸引力片段,到构建支持交互且保持物理合理性的虚拟环境的范式转变。这一发展指向了视频基础模型的兴起——它们不仅是视觉生成器,更作为隐式世界模型,能够模拟现实或想象世界中支配物理动态、智能体-环境交互及任务规划的规律。本文系统梳理了这一演进历程,将现代视频基础模型概念化为两个核心组件的结合:隐式世界模型与视频渲染器。世界模型编码关于世界的结构化知识,包括物理定律、交互动态和智能体行为,其作为潜在模拟引擎可实现连贯的视觉推理、长期时间一致性及目标驱动规划;视频渲染器则将这种潜在模拟转化为逼真的视觉观测,使生成的视频成为窥探模拟世界的"窗口"。我们追溯了视频生成技术的四代演进,其核心能力逐步升级,最终形成以视频生成模型为基底的世界模型,具备内在物理合理性、实时多模态交互能力以及跨时空尺度的规划功能。针对每一代技术,我们界定了其核心特征,重点介绍了代表性工作,并剖析了其在机器人、自动驾驶、交互式游戏等领域的应用。最后,我们探讨了下一代世界模型面临的开放挑战与设计原则,包括智能体智能在塑造与评估这些系统中的作用。相关研究的最新列表持续更新于本文链接。
English
The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.
PDF293December 1, 2025