PhysMaster:通過強化學習掌握視頻生成中的物理表徵
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
October 15, 2025
作者: Sihui Ji, Xi Chen, Xin Tao, Pengfei Wan, Hengshuang Zhao
cs.AI
摘要
现今的视频生成模型虽能生成视觉上逼真的视频,却常未能遵循物理定律,这限制了其生成物理上合理视频的能力,并阻碍了其作为“世界模型”的潜力。为解决此问题,我们提出了PhysMaster,它通过捕捉物理知识作为指导视频生成模型的表示,以增强其物理感知能力。具体而言,PhysMaster基于图像到视频的任务,模型需从输入图像中预测出物理上合理的动态变化。鉴于输入图像提供了场景中物体的相对位置及潜在交互等物理先验信息,我们设计了PhysEncoder,用于从图像中编码物理信息,作为额外条件注入视频生成过程,以融入物理知识。由于模型在物理表现上缺乏超越外观的适当监督,促使PhysEncoder在物理表示学习中应用了基于人类反馈的强化学习,利用生成模型的反馈,通过直接偏好优化(DPO)以端到端方式优化物理表示。PhysMaster为提升PhysEncoder乃至视频生成的物理感知能力提供了可行方案,通过在一个简单代理任务上的验证,展示了其在广泛物理场景中的通用性。这表明,我们的PhysMaster通过在强化学习范式中统一各种物理过程的解决方案,能够作为物理感知视频生成及更广泛应用中的通用且即插即用的解决方案。
English
Video generation models nowadays are capable of generating visually realistic
videos, but often fail to adhere to physical laws, limiting their ability to
generate physically plausible videos and serve as ''world models''. To address
this issue, we propose PhysMaster, which captures physical knowledge as a
representation for guiding video generation models to enhance their
physics-awareness. Specifically, PhysMaster is based on the image-to-video task
where the model is expected to predict physically plausible dynamics from the
input image. Since the input image provides physical priors like relative
positions and potential interactions of objects in the scenario, we devise
PhysEncoder to encode physical information from it as an extra condition to
inject physical knowledge into the video generation process. The lack of proper
supervision on the model's physical performance beyond mere appearance
motivates PhysEncoder to apply reinforcement learning with human feedback to
physical representation learning, which leverages feedback from generation
models to optimize physical representations with Direct Preference Optimization
(DPO) in an end-to-end manner. PhysMaster provides a feasible solution for
improving physics-awareness of PhysEncoder and thus of video generation,
proving its ability on a simple proxy task and generalizability to wide-ranging
physical scenarios. This implies that our PhysMaster, which unifies solutions
for various physical processes via representation learning in the reinforcement
learning paradigm, can act as a generic and plug-in solution for physics-aware
video generation and broader applications.