力提示:视频生成模型能够学习并泛化基于物理的控制信号
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals
May 26, 2025
作者: Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun
cs.AI
摘要
近期,視頻生成模型的進展激發了對能夠模擬真實環境的世界模型的興趣。雖然導航已被廣泛研究,但模仿現實世界力量的物理意義互動仍大多未被深入探討。在本研究中,我們探討了使用物理力量作為視頻生成的控制信號,並提出了力量提示,使用戶能夠通過局部點力量(如戳植物)和全局風力場(如風吹布料)與圖像互動。我們展示了這些力量提示能夠利用原始預訓練模型中的視覺和運動先驗,使視頻對物理控制信號做出真實反應,而無需在推理時使用任何3D資產或物理模擬器。力量提示的主要挑戰在於難以獲得高質量的配對力量-視頻訓練數據,這在現實世界中是由於獲取力量信號的困難,而在合成數據中則是由於物理模擬器的視覺質量和領域多樣性的限制。我們的主要發現是,當視頻生成模型適應於遵循由Blender合成的視頻中的物理力量條件時,即使僅有少量物體的演示,也能表現出顯著的泛化能力。我們的方法能夠生成模擬多種幾何形狀、場景和材料的力量的視頻。我們還試圖理解這種泛化的來源,並進行了消融實驗,揭示了兩個關鍵要素:視覺多樣性和訓練期間使用特定文本關鍵詞。我們的方法僅在四塊A100 GPU上訓練了大約15k個訓練樣本一天,並在力量遵循和物理真實性方面優於現有方法,使世界模型更接近於現實世界的物理互動。我們在項目頁面上發布了所有數據集、代碼、權重和互動視頻演示。
English
Recent advances in video generation models have sparked interest in world
models capable of simulating realistic environments. While navigation has been
well-explored, physically meaningful interactions that mimic real-world forces
remain largely understudied. In this work, we investigate using physical forces
as a control signal for video generation and propose force prompts which enable
users to interact with images through both localized point forces, such as
poking a plant, and global wind force fields, such as wind blowing on fabric.
We demonstrate that these force prompts can enable videos to respond
realistically to physical control signals by leveraging the visual and motion
prior in the original pretrained model, without using any 3D asset or physics
simulator at inference. The primary challenge of force prompting is the
difficulty in obtaining high quality paired force-video training data, both in
the real world due to the difficulty of obtaining force signals, and in
synthetic data due to limitations in the visual quality and domain diversity of
physics simulators. Our key finding is that video generation models can
generalize remarkably well when adapted to follow physical force conditioning
from videos synthesized by Blender, even with limited demonstrations of few
objects. Our method can generate videos which simulate forces across diverse
geometries, settings, and materials. We also try to understand the source of
this generalization and perform ablations that reveal two key elements: visual
diversity and the use of specific text keywords during training. Our approach
is trained on only around 15k training examples for a single day on four A100
GPUs, and outperforms existing methods on force adherence and physics realism,
bringing world models closer to real-world physics interactions. We release all
datasets, code, weights, and interactive video demos at our project page.Summary
AI-Generated Summary