MolmoMotion：基于语言指令的三维点轨迹预测

摘要

运动预测是视觉智能的核心：智能体必须预测物体将如何运动，以便规划行动、推理物理交互以及合成逼真的未来场景。我们认为，世界坐标系中的三维点提供了一种通用表示，这种表示具有类别无关、视角稳定、简洁紧凑且对下游任务直接有用的特性。我们形式化了目标条件三维点运动预测任务：给定一段简短的视觉历史、感兴趣物体上的一组三维查询点，以及意图目标的语言描述，模型需预测每个点的未来三维运动轨迹。我们引入了一套完整的流程来大规模研究这一任务：（1）MolmoMotion-1M 是一个大型语料库，包含来自116万段无约束视频的动作描述与物体锚定的三维点轨迹标注；（2）PointMotionBench 是一个经人工验证的基准，涵盖111个物体类别和61种运动类型；（3）MolmoMotion 是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion 能根据不同的语言指令准确预测多样化的运动模式，并在 PointMotionBench 上显著优于现有运动预测基线。最后，我们证明了所学习的三维运动先验能有效迁移至下游应用：它提升了机器人操作的训练效率与泛化能力，其预测的轨迹还能为生成模型提供有效的运动引导，从而合成物体运动更逼真的视频。

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.