探索視頻生成中物理認知的演變:一項綜述
Exploring the Evolution of Physics Cognition in Video Generation: A Survey
March 27, 2025
作者: Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin Wang
cs.AI
摘要
近年來,視頻生成領域取得了顯著進展,尤其是隨著擴散模型的快速發展。然而,這些模型在物理認知方面的不足逐漸受到廣泛關注——生成的內容常常違背基本物理定律,陷入「視覺真實但物理荒謬」的困境。研究人員開始日益認識到物理逼真度在視頻生成中的重要性,並嘗試將運動表徵和物理知識等啟發式物理認知整合到生成系統中,以模擬真實世界的動態場景。考慮到該領域缺乏系統性的綜述,本調查旨在提供架構設計及其應用的全面總結,以填補這一空白。具體而言,我們從認知科學的角度討論並梳理了視頻生成中物理認知的演進過程,同時提出了一個三層分類法:1)面向生成的基本圖式感知,2)面向生成的物理知識被動認知,以及3)面向世界模擬的主動認知,涵蓋了最先進的方法、經典範式和基準測試。隨後,我們強調了該領域固有的關鍵挑戰,並勾勒了未來研究的潛在路徑,為學術界和工業界的討論前沿做出貢獻。通過結構化回顧和跨學科分析,本調查旨在為開發可解釋、可控且物理一致的視頻生成範式提供方向性指導,從而推動生成模型從「視覺模仿」階段邁向「類人物理理解」的新階段。
English
Recent advancements in video generation have witnessed significant progress,
especially with the rapid advancement of diffusion models. Despite this, their
deficiencies in physical cognition have gradually received widespread attention
- generated content often violates the fundamental laws of physics, falling
into the dilemma of ''visual realism but physical absurdity". Researchers began
to increasingly recognize the importance of physical fidelity in video
generation and attempted to integrate heuristic physical cognition such as
motion representations and physical knowledge into generative systems to
simulate real-world dynamic scenarios. Considering the lack of a systematic
overview in this field, this survey aims to provide a comprehensive summary of
architecture designs and their applications to fill this gap. Specifically, we
discuss and organize the evolutionary process of physical cognition in video
generation from a cognitive science perspective, while proposing a three-tier
taxonomy: 1) basic schema perception for generation, 2) passive cognition of
physical knowledge for generation, and 3) active cognition for world
simulation, encompassing state-of-the-art methods, classical paradigms, and
benchmarks. Subsequently, we emphasize the inherent key challenges in this
domain and delineate potential pathways for future research, contributing to
advancing the frontiers of discussion in both academia and industry. Through
structured review and interdisciplinary analysis, this survey aims to provide
directional guidance for developing interpretable, controllable, and physically
consistent video generation paradigms, thereby propelling generative models
from the stage of ''visual mimicry'' towards a new phase of ''human-like
physical comprehension''.Summary
AI-Generated Summary