力提示:视频生成模型能够学习并泛化基于物理的控制信号
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals
May 26, 2025
作者: Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun
cs.AI
摘要
近期视频生成模型的进展引发了对能够模拟真实环境的世界模型的广泛关注。尽管导航领域已得到深入探索,但模拟现实世界物理力量的、具有物理意义的交互行为仍鲜有研究。本研究中,我们探讨了将物理力量作为视频生成的控制信号,并提出了一种力提示机制,使用户能够通过局部点力(如轻触植物)和全局风力场(如风吹动布料)与图像进行交互。我们展示了这些力提示能够利用原始预训练模型中的视觉与运动先验,无需在推理阶段使用任何3D资源或物理模拟器,即可使视频对物理控制信号作出逼真响应。力提示面临的主要挑战在于获取高质量的力-视频配对训练数据,这在实际中因难以获取力信号而受限,在合成数据中则受限于物理模拟器的视觉质量与领域多样性。我们的关键发现是,当视频生成模型适应于遵循由Blender合成的视频中的物理力条件时,即使仅有少量物体的演示,也能展现出卓越的泛化能力。我们的方法能够生成模拟多种几何形状、场景和材质受力的视频。我们还尝试理解这种泛化能力的来源,并通过消融实验揭示了两大关键要素:视觉多样性和训练过程中特定文本关键词的使用。我们的方法仅在四块A100 GPU上训练约一天,使用约1.5万个训练样本,便在力的遵循度和物理真实感上超越了现有方法,使世界模型更接近真实世界的物理交互。我们在项目页面上发布了所有数据集、代码、权重及交互式视频演示。
English
Recent advances in video generation models have sparked interest in world
models capable of simulating realistic environments. While navigation has been
well-explored, physically meaningful interactions that mimic real-world forces
remain largely understudied. In this work, we investigate using physical forces
as a control signal for video generation and propose force prompts which enable
users to interact with images through both localized point forces, such as
poking a plant, and global wind force fields, such as wind blowing on fabric.
We demonstrate that these force prompts can enable videos to respond
realistically to physical control signals by leveraging the visual and motion
prior in the original pretrained model, without using any 3D asset or physics
simulator at inference. The primary challenge of force prompting is the
difficulty in obtaining high quality paired force-video training data, both in
the real world due to the difficulty of obtaining force signals, and in
synthetic data due to limitations in the visual quality and domain diversity of
physics simulators. Our key finding is that video generation models can
generalize remarkably well when adapted to follow physical force conditioning
from videos synthesized by Blender, even with limited demonstrations of few
objects. Our method can generate videos which simulate forces across diverse
geometries, settings, and materials. We also try to understand the source of
this generalization and perform ablations that reveal two key elements: visual
diversity and the use of specific text keywords during training. Our approach
is trained on only around 15k training examples for a single day on four A100
GPUs, and outperforms existing methods on force adherence and physics realism,
bringing world models closer to real-world physics interactions. We release all
datasets, code, weights, and interactive video demos at our project page.Summary
AI-Generated Summary