ChatPaper.aiChatPaper

用人工智能模擬視覺世界:發展路線圖

Simulating the Visual World with Artificial Intelligence: A Roadmap

November 11, 2025
作者: Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu
cs.AI

摘要

影片生成領域的發展重心正在轉移:從專注於生成視覺效果吸引人的片段,轉向建構能支援互動且維持物理合理性的虛擬環境。這些進展預示著影片基礎模型的崛起,此類模型不僅作為視覺生成器,更扮演著隱性世界模型的角色——能夠模擬現實或虛想世界中物理動態、智能體與環境互動及任務規劃的系統。本文系統性梳理此演進歷程,將現代影片基礎模型概念化為兩個核心組件的結合:隱性世界模型與影片渲染器。世界模型編碼了關於世界的結構化知識,包含物理法則、互動動態與智能體行為,其作為潛在的模擬引擎,能實現連貫的視覺推理、長期時間一致性與目標驅動的規劃;影片渲染器則將此潛在模擬轉化為逼真的視覺觀測,使生成的影片成為透視模擬世界的「窗口」。我們沿四個世代追溯影片生成的演進脈絡,其核心能力逐步提升,最終形成以影片生成模型為基礎的世界模型,具備內在物理合理性、即時多模態互動能力,以及跨越多重時空尺度的規劃功能。針對每個世代,我們界定其核心特徵,列舉代表性研究,並剖析其在機器人、自動駕駛、互動遊戲等領域的應用。最後,我們探討下一代世界模型的開放挑戰與設計原則,包括智能體智慧在形塑與評估這些系統時的角色。相關研究的最新清單維護於此連結。
English
The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.
PDF303February 27, 2026