PhyCo：面向生成式运动学习的可控物理先验模型

摘要

现代视频扩散模型在外观合成方面表现出色，但在物理一致性方面仍存在不足：物体漂移、碰撞缺乏真实反弹、材质响应与底层属性难以匹配。我们提出PhyCo框架，通过引入连续、可解释且基于物理原理的控制机制来改进视频生成。该框架整合三大核心组件：（1）包含10万+条逼真仿真视频的大规模数据集，其中摩擦系数、恢复系数、形变程度和受力情况在不同场景中系统变化；（2）基于像素对齐物理属性映射的ControlNet条件控制，对预训练扩散模型进行物理监督微调；（3）视觉语言模型引导的奖励优化机制，通过微调后的VLM针对物理特性进行视频生成质量评估，并提供可微分反馈。这种组合使生成模型能通过调整物理属性产生物理一致且可控的输出——在推理过程中无需任何模拟器或几何重建。在Physics-IQ基准测试中，PhyCo相较于强基线模型显著提升了物理真实感；人工评估也证实其对物理属性的控制更清晰准确。我们的研究为开发具有物理一致性、可泛化至合成训练环境之外的可控生成式视频模型提供了可扩展路径。

English

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

PhyCo：面向生成式运动学习的可控物理先验模型

PhyCo: Learning Controllable Physical Priors for Generative Motion

摘要

Support