探索视频生成中的物理认知演进:一项综述
Exploring the Evolution of Physics Cognition in Video Generation: A Survey
March 27, 2025
作者: Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, Donglin Wang
cs.AI
摘要
近期,视频生成领域取得了显著进展,尤其是在扩散模型快速发展的推动下。然而,其在物理认知方面的不足逐渐引起广泛关注——生成内容常违背基本物理定律,陷入“视觉逼真但物理荒谬”的困境。研究者们日益认识到物理保真度在视频生成中的重要性,并尝试将运动表征及物理知识等启发式物理认知融入生成系统,以模拟现实世界的动态场景。鉴于该领域缺乏系统性综述,本文旨在通过全面总结架构设计及其应用来填补这一空白。具体而言,我们从认知科学的角度探讨并梳理了视频生成中物理认知的演进过程,同时提出了一个三层分类体系:1)面向生成的基础图式感知,2)面向生成的物理知识被动认知,3)面向世界模拟的主动认知,涵盖了最新方法、经典范式及基准测试。随后,我们强调了该领域固有的关键挑战,并勾勒出未来研究的潜在路径,为学术界与工业界的讨论前沿贡献力量。通过结构化回顾与跨学科分析,本综述旨在为开发可解释、可控且物理一致的视频生成范式提供方向性指导,从而推动生成模型从“视觉模仿”阶段迈向“类人物理理解”的新阶段。
English
Recent advancements in video generation have witnessed significant progress,
especially with the rapid advancement of diffusion models. Despite this, their
deficiencies in physical cognition have gradually received widespread attention
- generated content often violates the fundamental laws of physics, falling
into the dilemma of ''visual realism but physical absurdity". Researchers began
to increasingly recognize the importance of physical fidelity in video
generation and attempted to integrate heuristic physical cognition such as
motion representations and physical knowledge into generative systems to
simulate real-world dynamic scenarios. Considering the lack of a systematic
overview in this field, this survey aims to provide a comprehensive summary of
architecture designs and their applications to fill this gap. Specifically, we
discuss and organize the evolutionary process of physical cognition in video
generation from a cognitive science perspective, while proposing a three-tier
taxonomy: 1) basic schema perception for generation, 2) passive cognition of
physical knowledge for generation, and 3) active cognition for world
simulation, encompassing state-of-the-art methods, classical paradigms, and
benchmarks. Subsequently, we emphasize the inherent key challenges in this
domain and delineate potential pathways for future research, contributing to
advancing the frontiers of discussion in both academia and industry. Through
structured review and interdisciplinary analysis, this survey aims to provide
directional guidance for developing interpretable, controllable, and physically
consistent video generation paradigms, thereby propelling generative models
from the stage of ''visual mimicry'' towards a new phase of ''human-like
physical comprehension''.Summary
AI-Generated Summary