ChatPaper.aiChatPaper

DynaFLIP:透過三模態動態引導表徵重新思考機器人感知

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

May 28, 2026
作者: Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang
cs.AI

摘要

機器人操作很大程度上依賴於能夠保留場景中與動作相關面向的感知能力。然而,大多數機器人學習的流程是基於為靜態辨識或視覺-語言對齊而預先訓練的視覺編碼器,將動作的理解留給後續的策略。我們提出了DynaFLIP,一個具動力學感知能力的多模態預訓練框架,將動作理解推向感知的較前階段。我們從異質的人類與機器人影片中建構出影像-語言-3D光流三元組,並將這些三元組作為訓練時的監督訊號,來塑造一個僅以影像為輸入的編碼器。我們的核心想法是促使這三種模態在共享的超球面空間中橫跨一個小的單形體體積——體積越小代表對齊越強。為了避免單純最小化體積所帶來的幾何模糊性與瑣碎崩潰,我們將單形體體積最小化與一個餘弦正則化項以及一個對比目標相結合。我們的分析顯示,DynaFLIP聚焦於對操作至關重要的控制相關區域。所得到的具動力學感知的表示可作為可重複使用的視覺骨幹,並在各種下游策略(包括VLA)中持續優於基線。我們在模擬與真實世界的多種設置中驗證了這一點,在分佈外情境下可獲得最高+22.5%的提升。我們的結果表明,當視覺表示不僅被訓練來編碼「存在什麼」,還編碼「世界如何在動作下變化」時,機器人的泛化能力會有所提升。
English
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.