長期的な運動の埋め込み学習による効率的なキネマティクス生成

要旨

運動の理解と予測は、視覚的知能の基本的な構成要素である。現代のビデオモデルはシーン動態の強力な理解能力を示すが、完全なビデオ合成を通じて複数の可能な未来を探ることは、依然として非現実的な非効率さを伴う。我々は、トラッカーモデルから得られた大規模な軌跡データから学習した長期運動埋め込みを直接操作することで、シーン動態を桁違いに効率的にモデル化する。これにより、テキストプロンプトや空間的介入によって指定された目標を満たす、長く現実的な運動の効率的な生成を可能にする。これを実現するため、まず時間圧縮率64倍の高圧縮運動埋め込みを学習する。この空間において、条件付きフローマッチングモデルを訓練し、タスク記述を条件とした運動潜在変数を生成する。その結果得られる運動分布は、最先端のビデオモデルと専門的なタスク特化型アプローチの両方を凌駕する性能を示す。

English

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

長期的な運動の埋め込み学習による効率的なキネマティクス生成

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

要旨

Support