幻影：基于视觉与潜在物理动力学联合建模的物理融合视频生成

摘要

近期，基于大规模数据集与强大架构的生成式视频建模技术取得了视觉真实感的显著突破。然而，新近研究表明，单纯扩大数据与模型规模并不能使这些系统理解现实世界动态背后的物理规律。现有方法往往难以捕捉或强化这种物理一致性，导致运动与动态效果失真。本研究旨在探索将潜在物理属性推断直接融入视频生成过程，能否使模型具备生成符合物理规律视频的能力。为此，我们提出Phantom——一种物理增强视频生成模型，可同步建模视觉内容与潜在物理动态。该模型以观测视频帧和推断物理状态为条件，联合预测潜在物理动态并生成未来视频帧。Phantom采用物理感知的视频表征作为底层物理规律的抽象化信息嵌入，无需显式定义复杂的物理动态属性集合，即可实现物理动态与视频内容的协同预测。通过将物理感知视频表征推断直接整合至视频生成流程，Phantom生成的视频序列既保持视觉真实感，又符合物理一致性。在标准视频生成与物理感知基准测试中的定量与定性结果表明，Phantom不仅在物理动态遵循度上超越现有方法，同时保持了具有竞争力的视觉保真度。

English

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.