DynaFLIP: 通过三模态动力学引导的表征重新思考机器人感知

摘要

机器人操作的关键依赖于能够保留场景中与动作相关方面的感知。然而，大多数机器人学习流程都建立在为静态识别或视觉-语言对齐而预训练的视觉编码器上，将运动理解留给下游策略。我们提出了DynaFLIP，一种动力学感知的多模态预训练框架，将运动理解推至感知阶段的上游。我们从异构的人类和机器人视频中构建图像-语言-3D流三元组，并利用这些三元组作为训练时的监督信号来塑造仅基于图像的编码器。我们的核心思想是鼓励三种模态在共享的超球面空间中占据一个小的单形体体积——单形体体积越小表示对齐越强。为了避免朴素体积最小化带来的几何模糊性和平凡坍缩，我们将单形体体积最小化与余弦正则化器及对比目标相结合。我们的分析表明，DynaFLIP聚焦于对操作至关重要的控制相关区域。由此产生的动力学感知表征可作为可复用的视觉主干，并在包括VLA在内的多样化下游策略中持续优于基线。我们在多种仿真和真实世界设置中验证了这一点，在分布外场景下性能提升高达22.5%。我们的结果表明，当视觉表征不仅被训练编码“存在什么”，还编码“世界在动作下如何变化”时，机器人泛化能力会得到提升。

English

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.