DynaFLIP: 三モーダルダイナミクス誘導表現によるロボット知覚の再考

要旨

ロボット操作は、シーンの動作関連側面を保存する認識に決定的に依存する。しかし、ほとんどのロボット学習パイプラインは、静的認識または視覚言語アライメントのために事前学習された視覚エンコーダに基づいて構築されており、動作理解は下流のポリシーに委ねられている。本稿では、動作理解を認識段階へと組み込む、ダイナミクスを考慮したマルチモーダル事前学習フレームワークDynaFLIPを提案する。異種の人間およびロボットのビデオから画像・言語・3Dフローのトリプレットを構築し、これらを訓練時の教師信号として用いて画像のみのエンコーダを形成する。核となるアイデアは、三つのモダリティが共有超球面空間内で小さな単体体積を張るように促すことである（単体体積が小さいほど強いアライメントを示す）。単純な体積最小化の幾何学的曖昧性と自明な崩壊を避けるため、単体体積最小化をコサイン正則化項および対照的目的関数と組み合わせる。分析により、DynaFLIPは操作に重要な制御関連領域に焦点を当てていることが示される。得られたダイナミクスを考慮した表現は再利用可能な視覚バックボーンとして機能し、VLAを含む多様な下流ポリシーにおいて一貫してベースラインを上回る。このことは、シミュレーションおよび実世界の多様なセットアップで検証されており、分布外シナリオでは最大+22.5%の改善が達成されている。我々の結果は、視覚表現が存在するものだけでなく、行動によって世界がどのように変化するかを符号化するように訓練されるとき、ロボットの汎化性能が向上することを示唆している。

English

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.