MolmoMotion: Voorspelling van puntbanen in 3D met taalinstuctie

Samenvatting

Bewegingsvoorspelling staat centraal in visuele intelligentie: agenten moeten anticiperen op hoe objecten zullen bewegen om acties te plannen, fysieke interacties te beredeneren en realistische toekomsten te synthetiseren. Wij betogen dat 3D-punten in wereldcoördinaten een algemene representatie bieden die klasse-agnostisch, zichtstabiel, compact en direct bruikbaar is voor downstream-taken. We formaliseren de taak van doelgeconditioneerde 3D-puntbewegingsvoorspelling: gegeven een korte visuele geschiedenis, een set 3D-querypunten op een object van interesse, en een taalomschrijving van het beoogde doel, voorspelt het model het toekomstige 3D-traject van elk punt. We introduceren een volledige stack om deze taak op schaal te bestuderen: (1) MolmoMotion-1M is een groot corpus van actie-beschreven, object-gefundeerde 3D-punttrajecten geannoteerd uit 1,16M onbeperkte video's; (2) PointMotionBench is een door mensen geverifieerde benchmark die 111 objectcategorieën en 61 bewegingstypen omvat; en (3) MolmoMotion is een algemeen bewegingsvoorspellingsmodel dat zowel autoregressieve coördinatenvoorspelling als op stroommatching gebaseerde trajectgeneratie ondersteunt. MolmoMotion voorspelt nauwkeurig diverse bewegingspatronen met verschillende taalinstucties en presteert aanzienlijk beter dan bestaande bewegingsvoorspellingsbaselines op PointMotionBench. Tot slot tonen we aan dat de geleerde 3D-bewegingsprior goed overdraagt naar downstream-toepassingen: het verbetert de trainingsefficiëntie en generalisatie voor robotmanipulatie, en de voorspelde trajecten bieden effectieve bewegingsbegeleiding voor generatieve modellen om video's met realistischere objectbeweging te synthetiseren.

English

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.