力提示：视频生成模型能够学习并泛化基于物理的控制信号

摘要

近期，視頻生成模型的進展激發了對能夠模擬真實環境的世界模型的興趣。雖然導航已被廣泛研究，但模仿現實世界力量的物理意義互動仍大多未被深入探討。在本研究中，我們探討了使用物理力量作為視頻生成的控制信號，並提出了力量提示，使用戶能夠通過局部點力量（如戳植物）和全局風力場（如風吹布料）與圖像互動。我們展示了這些力量提示能夠利用原始預訓練模型中的視覺和運動先驗，使視頻對物理控制信號做出真實反應，而無需在推理時使用任何3D資產或物理模擬器。力量提示的主要挑戰在於難以獲得高質量的配對力量-視頻訓練數據，這在現實世界中是由於獲取力量信號的困難，而在合成數據中則是由於物理模擬器的視覺質量和領域多樣性的限制。我們的主要發現是，當視頻生成模型適應於遵循由Blender合成的視頻中的物理力量條件時，即使僅有少量物體的演示，也能表現出顯著的泛化能力。我們的方法能夠生成模擬多種幾何形狀、場景和材料的力量的視頻。我們還試圖理解這種泛化的來源，並進行了消融實驗，揭示了兩個關鍵要素：視覺多樣性和訓練期間使用特定文本關鍵詞。我們的方法僅在四塊A100 GPU上訓練了大約15k個訓練樣本一天，並在力量遵循和物理真實性方面優於現有方法，使世界模型更接近於現實世界的物理互動。我們在項目頁面上發布了所有數據集、代碼、權重和互動視頻演示。

English

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

力提示：视频生成模型能够学习并泛化基于物理的控制信号

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

摘要

Support