幻影：基于视觉与潜在物理动力学联合建模的物理融合视频生成

摘要

近期，基於大規模數據集與強大架構的生成式影片建模技術取得了視覺真實感的顯著突破。然而，新近研究表明，單純擴大數據和模型規模並不能使這些系統理解現實世界動態背後的物理規律。現有方法往往難以捕捉或強化物理一致性，導致運動與動力學表現失真。本研究探討將潛在物理屬性推斷直接整合至影片生成過程，能否使模型具備輸出物理合理影片的能力。為此，我們提出Phantom——一種物理增強影片生成模型，可對視覺內容與潛在物理動態進行聯合建模。該模型以觀測影片幀和推斷物理狀態為條件，同步預測潛在物理動態並生成後續影片幀。Phantom採用物理感知的影片表徵作為底層物理機制的抽象化信息嵌入，無需明確定義複雜的物理動態屬性集合，即可實現物理動態與影片內容的聯合預測。通過將物理感知影片表徵的推斷直接融入生成流程，Phantom生成的影片序列兼具視覺真實性與物理一致性。在標準影片生成與物理感知基準測試中的定量與定性結果表明，Phantom不僅在物理動態遵循度上超越現有方法，同時保持了具有競爭力的感知逼真度。

English

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

幻影：基于视觉与潜在物理动力学联合建模的物理融合视频生成

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

摘要

Support