MolmoMotion：透過語言指令預測3D點軌跡

摘要

运动预测是视觉智能的核心：智能体必须预测物体将如何移动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，这种表示具有类别无关、视角稳定、紧凑且可直接用于下游任务的特点。我们形式化了目标条件3D点运动预测任务：给定一段简短的视觉历史、感兴趣物体上的一组3D查询点以及目标意图的语言描述，模型预测每个点未来的3D轨迹。我们引入了一个完整的堆栈来大规模研究该任务：（1）MolmoMotion-1M是一个大型语料库，包含从116万个无约束视频中标注的、带有动作描述且基于物体的3D点轨迹；（2）PointMotionBench是一个经人工验证的基准，涵盖111个物体类别和61种运动类型；（3）MolmoMotion是一个通用运动预测模型，同时支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能够根据不同语言指令准确预测多种运动模式，并在PointMotionBench上显著优于现有的运动预测基线。最后，我们证明学习到的3D运动先验可以很好地迁移到下游应用：它提升了机器人操作的训练效率和泛化能力，同时其预测的轨迹为生成模型提供了有效的运动指导，使其合成的视频中物体运动更加逼真。

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.