MolmoMotion: Vorhersage von Punkt-Trajektorien in 3D mit Sprachanweisung

Zusammenfassung

Bewegungsvorhersage ist zentral für die visuelle Intelligenz: Agenten müssen antizipieren, wie sich Objekte bewegen, um Handlungen zu planen, physikalische Interaktionen zu durchdenken und realistische Zukünfte zu synthetisieren. Wir argumentieren, dass 3D-Punkte in Weltkoordinaten eine allgemeine Repräsentation bieten, die klassenagnostisch, blickstabil, kompakt und direkt für nachgelagerte Aufgaben nützlich ist. Wir formalisieren die Aufgabe der zielbedingten 3D-Punkt-Bewegungsvorhersage: Gegeben eine kurze visuelle Vorgeschichte, eine Menge von 3D-Abfragepunkten auf einem interessierenden Objekt und eine Sprachbeschreibung des beabsichtigten Ziels, sagt das Modell die zukünftige 3D-Trajektorie jedes Punktes voraus. Wir stellen einen vollständigen Stack zur Untersuchung dieser Aufgabe in großem Maßstab vor: (1) MolmoMotion-1M ist ein großes Korpus von handlungsbeschriebenen, objektverankerten 3D-Punkt-Trajektorien, die aus 1,16 Millionen uneingeschränkten Videos annotiert wurden; (2) PointMotionBench ist ein menschlich verifizierter Benchmark, der 111 Objektkategorien und 61 Bewegungstypen umfasst; und (3) MolmoMotion ist ein allgemeines Bewegungsvorhersagemodell, das sowohl autoregressive Koordinatenvorhersage als auch auf Flussabgleich basierende Trajektoriengenerierung unterstützt. MolmoMotion sagt präzise diverse Bewegungsmuster mit unterschiedlichen Sprachinstruktionen voraus und übertrifft bestehende Bewegungsvorhersage-Baselines auf PointMotionBench deutlich. Schließlich zeigen wir, dass der gelernte 3D-Bewegungs-Prior gut auf nachgelagerte Anwendungen übertragbar ist: Er verbessert die Trainingseffizienz und Generalisierung für die Robotersteuerung, und seine vorhergesagten Trajektorien liefern eine effektive Bewegungshilfe für generative Modelle, um Videos mit realistischeren Objektbewegungen zu synthetisieren.

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.