探索視頻生成中物理認知的演變：一項綜述

摘要

近年來，視頻生成領域取得了顯著進展，尤其是隨著擴散模型的快速發展。然而，這些模型在物理認知方面的不足逐漸受到廣泛關注——生成的內容常常違背基本物理定律，陷入「視覺真實但物理荒謬」的困境。研究人員開始日益認識到物理逼真度在視頻生成中的重要性，並嘗試將運動表徵和物理知識等啟發式物理認知整合到生成系統中，以模擬真實世界的動態場景。考慮到該領域缺乏系統性的綜述，本調查旨在提供架構設計及其應用的全面總結，以填補這一空白。具體而言，我們從認知科學的角度討論並梳理了視頻生成中物理認知的演進過程，同時提出了一個三層分類法：1）面向生成的基本圖式感知，2）面向生成的物理知識被動認知，以及3）面向世界模擬的主動認知，涵蓋了最先進的方法、經典範式和基準測試。隨後，我們強調了該領域固有的關鍵挑戰，並勾勒了未來研究的潛在路徑，為學術界和工業界的討論前沿做出貢獻。通過結構化回顧和跨學科分析，本調查旨在為開發可解釋、可控且物理一致的視頻生成範式提供方向性指導，從而推動生成模型從「視覺模仿」階段邁向「類人物理理解」的新階段。

English

Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

探索視頻生成中物理認知的演變：一項綜述

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

摘要

Support