映像生成における物理的認知の進化を探る：サーベイ

要旨

近年、ビデオ生成技術は著しい進歩を遂げており、特に拡散モデルの急速な発展が目覚ましい。しかしながら、物理的認知能力の欠如が次第に広く認識されるようになり、生成されたコンテンツが物理の基本法則に反する「視覚的リアリズムだが物理的には不条理」というジレンマに陥ることが多くなっている。研究者たちは、ビデオ生成における物理的忠実度の重要性をますます認識し、運動表現や物理的知識といったヒューリスティックな物理的認知を生成システムに統合し、現実世界の動的シナリオをシミュレートしようと試みている。この分野における体系的な概観の欠如を考慮し、本調査はアーキテクチャ設計とその応用を包括的にまとめることでこのギャップを埋めることを目的としている。具体的には、認知科学の観点からビデオ生成における物理的認知の進化プロセスを議論し整理するとともに、1)生成のための基本的なスキーマ知覚、2)生成のための物理的知識の受動的認知、3)世界シミュレーションのための能動的認知という3層の分類を提案し、最先端の手法、古典的なパラダイム、ベンチマークを網羅している。その後、この領域に内在する主要な課題を強調し、将来の研究のための潜在的な道筋を描き、学界と産業界の議論の最前線を進めることに貢献する。構造化されたレビューと学際的分析を通じて、本調査は解釈可能で制御可能、かつ物理的に一貫したビデオ生成パラダイムを開発するための方向性を示し、生成モデルを「視覚的模倣」の段階から「人間のような物理的理解」の新たな段階へと推進することを目指している。

English

Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ''visual mimicry'' towards a new phase of ''human-like physical comprehension''.

映像生成における物理的認知の進化を探る：サーベイ

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

要旨

Support