MolmoMotion: 言語指示による3D点軌跡の予測

要旨

動作予測は視覚的知能の中核をなす。エージェントは、行動計画、物理的相互作用の推論、現実的な未来の合成を行うために、物体がどのように動くかを予測しなければならない。我々は、ワールド座標系における3D点が、クラス非依存、視点安定、コンパクトであり、下流タスクに直接利用可能な汎用的表現を提供すると主張する。我々は、目標条件付き3D点動作予測（goal-conditioned 3D point motion forecasting）というタスクを定式化する。すなわち、短い視覚的な履歴、注目物体上の一連の3Dクエリ点、意図された目標の言語記述が与えられたとき、モデルは各点の将来の3D軌跡を予測する。我々は、このタスクを大規模に研究するための完全なスタックを導入する。(1) MolmoMotion-1Mは、116万本の制約のない動画から注釈付けされた、動作記述付き物体接地型3D点軌跡の大規模コーパスである。(2) PointMotionBenchは、111の物体カテゴリと61の動作タイプにわたる、人間検証済みのベンチマークである。(3) MolmoMotionは、自己回帰的座標予測とフローマッチングに基づく軌跡生成の両方をサポートする汎用動作予測モデルである。MolmoMotionは、異なる言語指示に応じて多様な動作パターンを正確に予測し、PointMotionBenchにおいて既存の動作予測ベースラインを大幅に上回る。最後に、学習された3D動作事前知識が下流アプリケーションに良好に転移することを示す。この事前知識はロボット操作の学習効率と汎化性能を向上させ、その予測軌跡は生成モデルに対して、より現実的な物体動作で動画を合成するための効果的な動作ガイダンスを提供する。

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.