MolmoMotion: 언어 명령을 사용한 3D 점 궤적 예측

초록

움직임 예측은 시각 지능의 핵심이다: 에이전트는 행동을 계획하고, 물리적 상호작용을 추론하며, 현실적인 미래를 합성하기 위해 객체가 어떻게 움직일지 예상해야 한다. 우리는 세계 좌표계의 3D 포인트가 클래스에 구애받지 않고, 시점에 안정적이며, 간결하고, 하위 작업에 직접적으로 유용한 일반적인 표현을 제공한다고 주장한다. 우리는 목표 조건부 3D 포인트 움직임 예측 작업을 공식화한다: 짧은 시각적 이력, 관심 객체의 3D 쿼리 포인트 집합, 그리고 의도된 목표에 대한 언어 설명이 주어지면 모델은 각 포인트의 미래 3D 궤적을 예측한다. 우리는 이 작업을 대규모로 연구하기 위한 전체 스택을 소개한다: (1) MolmoMotion-1M은 116만 개의 제약 없는 비디오에서 주석이 달린, 행동으로 설명되고 객체에 기반한 3D 포인트 궤적의 대규모 코퍼스이다; (2) PointMotionBench는 111개 객체 범주와 61개 움직임 유형에 걸친 인간 검증 벤치마크이다; (3) MolmoMotion은 자기회귀 좌표 예측과 흐름 매칭 기반 궤적 생성을 모두 지원하는 일반 움직임 예측 모델이다. MolmoMotion은 다양한 언어 명령으로 다양한 움직임 패턴을 정확하게 예측하며, PointMotionBench에서 기존 움직임 예측 기준선을 크게 능가한다. 마지막으로, 학습된 3D 움직임 사전이 하위 응용 프로그램에 잘 전이됨을 보여준다: 이는 로봇 조작을 위한 훈련 효율성과 일반화를 향상시키며, 예측된 궤적은 생성 모델이 더 현실적인 객체 움직임으로 비디오를 합성하도록 효과적인 움직임 안내를 제공한다.

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.