장기적 운동 임베딩 학습을 통한 효율적인 운동학 생성

초록

움직임을 이해하고 예측하는 것은 시각 지능의 핵심 요소입니다. 현대 비디오 모델은 장면 동역학에 대한 강력한 이해력을 보여주지만, 전체 비디오 합성을 통해 여러 가지 가능한 미래를 탐색하는 것은 여전히 비효율적입니다. 우리는 트래커 모델에서 얻은 대규모 궤적 데이터로 학습된 장기적 운동 임베딩을 직접 활용하여 장면 동역학을 훨씬 더 효율적으로 모델링합니다. 이를 통해 텍스트 프롬프트나 공간적 자극을 통해 지정된 목표를 충족하는 길고 현실적인 동작을 효율적으로 생성할 수 있습니다. 이를 위해 먼저 시간 압축률 64배의 고도로 압축된 운동 임베딩을 학습합니다. 이 공간에서 작업 설명에 조건부된 운동 잠재 코드를 생성하기 위해 조건부 흐름 매칭 모델을 학습합니다. 그 결과 생성된 운동 분포는 최첨단 비디오 모델과 특화된 작업별 접근법 모두를 능가하는 성능을 보여줍니다.

English

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

장기적 운동 임베딩 학습을 통한 효율적인 운동학 생성

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

초록

Support