目標力: 映像モデルに物理条件付き目標の達成を教える

要旨

近年の映像生成技術の進歩により、ロボティクスや計画立案のための潜在的な未来をシミュレート可能な「世界モデル」の開発が可能となってきた。しかし、これらのモデルに対して正確な目標を特定することは依然として課題である。テキスト指示は物理的なニュアンスを捉えるには抽象的すぎることが多く、目標画像は動的タスクに対して指定することが往々にして非現実的である。この問題に対処するため、我々はGoal Forceを提案する。これは、人間が物理的タスクを概念化する方法に倣い、ユーザーが明示的な力ベクトルと中間的な力学を通じて目標を定義できる新しいフレームワークである。我々は、弾性衝突やドミノ倒しなどの合成的因果プリミティブからなる精選されたデータセットを用いて映像生成モデルを学習し、力を時間と空間にわたって伝播することを教える。単純な物理データで学習されているにもかかわらず、本モデルは、工具操作や多物体の因果連鎖を含む複雑な実世界のシナリオに対して、驚くべきゼロショット一般化能力を示す。我々の結果は、映像生成を基本的な物理的相互作用に基づかせることで、モデルが暗黙的なニューラル物理シミュレータとして出現し、外部エンジンに依存することなく、物理を意識した精密な計画立案を可能にし得ることを示唆している。すべてのデータセット、コード、モデル重み、インタラクティブな映像デモをプロジェクトページで公開している。

English

Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

目標力: 映像モデルに物理条件付き目標の達成を教える

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

要旨

Support