DynaFLIP: 삼중 모드 동역학 기반 표현을 통한 로봇 지각의 재고

초록

로봇 조작은 동작과 관련된 장면의 측면을 보존하는 인식에 결정적으로 의존한다. 그러나 대부분의 로봇 학습 파이프라인은 정적 인식 또는 시각-언어 정렬을 위해 사전 학습된 시각 인코더를 기반으로 구축되어, 동작 이해는 하위 정책에 맡겨진다. 본 논문에서는 동작 인식을 인식(perception) 단계로 상향 이동시키는 동적 인식 기반의 다중 모달 사전 학습 프레임워크인 DynaFLIP을 제안한다. 우리는 이질적인 인간 및 로봇 비디오로부터 이미지-언어-3D 흐름 삼중항을 구성하고, 이 삼중항을 학습 시간의 감독 신호로 사용하여 이미지 전용 인코더를 형성한다. 핵심 아이디어는 세 가지 모달리티가 공유 초구 공간에서 작은 단체 부피(simplex volume)를 형성하도록 유도하는 것이다. 단체 부피가 작을수록 더 강한 정렬을 의미한다. 단순한 부피 최소화의 기하학적 모호성과 사소한 붕괴(trivial collapse)를 방지하기 위해, 단체 부피 최소화를 코사인 정칙화 항(cosine regularizer) 및 대조 손실(contrastive objective)과 결합한다. 분석 결과 DynaFLIP은 조작에 중요한 제어 관련 영역에 초점을 맞추는 것으로 나타났다. 결과적으로 얻어진 동적 인식 기반 표현은 재사용 가능한 시각 백본 역할을 하며, VLA(비전-언어-행동) 모델을 포함한 다양한 하위 정책에서 일관되게 기준선을 능가한다. 다양한 시뮬레이션 및 실제 환경에서 검증한 결과, 분포 외 시나리오에서 최대 +22.5%의 성능 향상을 달성했다. 본 결과는 시각적 표현이 단순히 무엇이 존재하는지뿐만 아니라, 행동 하에서 세계가 어떻게 변화하는지를 인코딩하도록 학습될 때 로봇 일반화가 향상됨을 시사한다.

English

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.