ファントム：視覚的ダイナミクスと潜在物理ダイナミクスの共同モデリングによる物理法則融合型ビデオ生成

要旨

大規模データセットと強力なアーキテクチャに支えられた生成的ビデオモデリングの近年の進展は、驚くべき視覚的リアリズムをもたらしている。しかしながら、単にデータとモデル規模を拡大するだけでは、現実世界のダイナミクスを支配する物理法則の理解をこれらのシステムに付与しないという証拠が浮上している。既存の手法は、しばしばこのような物理的一貫性を捉えられず、非現実的な動きやダイナミクスを生み出す。本研究では、潜在的な物理特性の推論をビデオ生成プロセスに直接統合することが、物理的に妥当なビデオを生成する能力をモデルに与えうるかどうかを検討する。この目的のために、視覚的コンテンツと潜在的な物理ダイナミクスを共同でモデル化する物理知識統合ビデオ生成モデル「Phantom」を提案する。観測されたビデオフレームと推論された物理状態を条件として、Phantomは潜在的な物理ダイナミクスを予測し、将来のビデオフレームを生成する。Phantomは、複雑な物理ダイナミクスや特性の明示的な仕様を必要とせずに、物理ダイナミクスとビデオコンテンツの共同予測を容易にする、基礎となる物理の抽象的でありながら情報量の多い埋め込みとして機能する物理認識ビデオ表現を活用する。物理認識ビデオ表現の推論をビデオ生成プロセスに直接統合することにより、Phantomは視覚的にリアルでありながら物理的にも一貫したビデオシーケンスを生成する。標準的なビデオ生成と物理認識ベンチマークにおける定量的・定性的な結果は、Phantomが物理ダイナミクスへの忠実度において既存手法を凌駕するだけでなく、競争力のある知覚的忠実度も提供することを実証している。

English

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

ファントム：視覚的ダイナミクスと潜在物理ダイナミクスの共同モデリングによる物理法則融合型ビデオ生成

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

要旨

Support