DynaFLIP: Herzien van Robotica-Perceptie via Tri-Modale Dynamica-Gestuurde Representatie

Samenvatting

Robotmanipulatie is in hoge mate afhankelijk van perceptie die de actierelevante aspecten van een scène behoudt. Toch zijn de meeste robotleerpijplijnen gebouwd op visuele encoders die zijn voorgetraind voor statische herkenning of visie-taalalignement, waardoor bewegingsbegrip wordt overgelaten aan stroomafwaartse beleidsstrategieën. We introduceren DynaFLIP, een dynamiekbewust multimodaal pre-trainingsraamwerk dat bewegingsbegrip stroomopwaarts in de perceptie plaatst. We construeren beeld-taal-3D-stroomtriplets uit heterogene menselijke en robotvideo's, en gebruiken deze triplets als trainingstijd-supervisie om een alleen-beeldencoder te vormen. Ons kernidee is om de drie modaliteiten aan te moedigen een klein simplexvolume in de gedeelde hypersferische ruimte te overspannen – een kleiner simplexvolume duidt op een sterkere alignement. Om de geometrische ambiguïteit en triviale ineenstorting van naïeve volumeminimalisatie te vermijden, combineren we simplexvolume-minimalisatie met een cosinusregularisator en een contrastief doel. Onze analyses tonen aan dat DynaFLIP zich richt op controle-relevante regio's die essentieel zijn voor manipulatie. De resulterende dynamiekbewuste representaties dienen als herbruikbare visuele basismodellen en presteren consistent beter dan referentiemodellen in diverse stroomafwaartse beleidsstrategieën, waaronder VLA's. We valideren dit in diverse simulatie- en praktijkopstellingen, met verbeteringen tot +22,5% in buitendistributiescenario's. Onze resultaten suggereren dat robotgeneralisatie verbetert wanneer visuele representaties worden getraind om niet alleen vast te leggen wat aanwezig is, maar hoe de wereld verandert onder actie.

English

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.