学习长期运动嵌入以实现高效运动学生成

摘要

理解与预测运动是视觉智能的基础组成部分。尽管现代视频模型展现出对场景动态的强大理解能力，但通过完整视频合成来探索多种可能未来的方法仍存在效率瓶颈。我们通过直接对运动嵌入空间进行操作，实现了数量级更高效的场景动态建模——该嵌入空间是从追踪器模型获取的大规模轨迹数据中学习得到的。这种方法能够高效生成符合文本提示或空间戳指定目标的长时间真实运动轨迹。为实现这一目标，我们首先学习具有64倍时间压缩因子的高度压缩运动嵌入表示。在此空间内，我们训练条件流匹配模型以根据任务描述生成运动潜变量。实验表明，该方法生成的运动分布质量超越了当前最先进的视频模型和专用任务型方法。

English

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

学习长期运动嵌入以实现高效运动学生成

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

摘要

Support