视频生成模型的推理时物理对齐与潜在世界模型
Inference-time Physics Alignment of Video Generative Models with Latent World Models
January 15, 2026
作者: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
cs.AI
摘要
当前顶尖的视频生成模型虽能产出视觉效果出色的内容,却常违背基础物理规律,限制了其实用性。尽管有观点将此归因于预训练阶段对物理规律理解不足,我们发现物理合理性的缺失还源于欠佳的推理策略。为此,我们提出WMReward方法,将提升视频生成的物理合理性视为推理阶段的对齐问题。具体而言,我们利用隐式世界模型(本文采用VJEPA-2)的强物理先验作为奖励函数,通过搜索并引导多条候选去噪轨迹,实现测试阶段计算资源的灵活扩展以提升生成性能。实验表明,该方法在图像条件生成、多帧条件生成及文本条件生成场景下均显著提升物理合理性,并获人类偏好研究验证。在ICCV 2025感知测试物理智商挑战赛中,我们以62.64%的最终得分夺得冠军,较此前最优结果提升7.42%。本研究证明了利用隐式世界模型提升视频生成物理合理性的可行性,其价值超越特定模型实现或参数化方案。
English
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.