フォースプロンプティング：ビデオ生成モデルは物理ベースの制御信号を学習し一般化できる

要旨

最近のビデオ生成モデルの進展により、現実的な環境をシミュレート可能なワールドモデルへの関心が高まっています。ナビゲーションは十分に研究されてきましたが、現実世界の力を模倣する物理的に意味のある相互作用は、まだほとんど研究されていません。本研究では、物理的な力をビデオ生成の制御信号として利用する方法を調査し、植物を突くような局所的な点力や、布に風が吹くようなグローバルな風力場を通じて、ユーザーが画像と相互作用できる「フォースプロンプト」を提案します。これらのフォースプロンプトにより、元の事前学習済みモデルの視覚的および運動的な事前知識を活用することで、推論時に3Dアセットや物理シミュレータを使用せずに、物理的な制御信号に対して現実的に応答するビデオを生成できることを示します。フォースプロンプトの主な課題は、現実世界では力信号の取得が困難であること、また合成データでは物理シミュレータの視覚品質とドメイン多様性に制限があるため、高品質な力-ビデオのペア訓練データを取得することが難しい点です。私たちの重要な発見は、Blenderで合成されたビデオから物理的な力の条件付けに適応させた場合、限られたオブジェクトのデモンストレーションであっても、ビデオ生成モデルが驚くほどよく一般化できることです。私たちの手法は、多様な形状、設定、および材料にわたる力をシミュレートするビデオを生成できます。また、この一般化の源を理解するために、視覚的多様性と訓練中の特定のテキストキーワードの使用という2つの重要な要素を明らかにするアブレーション実験を行いました。私たちのアプローチは、4つのA100 GPUで1日程度で約15kの訓練例で訓練され、力の遵守と物理的リアリズムにおいて既存の手法を上回り、ワールドモデルを現実世界の物理的相互作用に近づけます。すべてのデータセット、コード、重み、およびインタラクティブなビデオデモをプロジェクトページで公開しています。

English

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

フォースプロンプティング：ビデオ生成モデルは物理ベースの制御信号を学習し一般化できる

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

要旨

Support