PlatonicNav: 플라톤 위상 지도를 통한 내비게이션의 의미론적 대응 규명

초록

내재된 시각적 내비게이션(Embodied Visual Navigation)은 에이전트가 복잡한 환경을 인지하고 원시 감각 입력으로부터 목표에 도달하기 위해 행동하는 기술로, 가정용 서비스 로봇, 보조 로봇, 대규모 자율 탐사 등 다양한 응용 분야의 기반을 이룬다. 그러나 최근 시각-언어 내비게이션(VLN)과 객체 목표 내비게이션(ObjNav)을 통합하려는 시도들은 아키텍처 융합, 혼합 작업 훈련, 대규모 시각-언어 사전 훈련 수준에 머물러 있으며, 독립적으로 훈련된 시각 및 언어 인코더가 이미 공통의 의미 구조를 공유하고 있을 가능성은 검토하지 않았다. 더욱이 객체 중심 위상 지도(Object-centric Topological Map)조차 CLIP이나 대규모 시각-언어 모델과 같은 명시적 교차 양식 감독(Cross-modal Supervision)을 통해 언어 목표를 기반화(Grounding)하고 있어, 순수 시각 기반 지도만으로 그러한 기반화가 가능한지에 대한 질문은 여전히 열려 있다. 이러한 문제들을 해결하기 위해, 우리는 플라톤적 표상 가설(Platonic Representation Hypothesis)을 내재적 내비게이션으로 확장하고, 시각 전용 ObjNav, 교차 양식 ObjNav, VLN을 동일한 객체 중심 의미 다양체(Object-centric Semantic Manifold)에 대한 세 가지 다른 인터페이스로 재정의한다. 또한, 우리는 훈련이 필요 없는 프레임워크인 PlatonicNav를 소개한다. PlatonicNav의 플라톤적 위상 지도(Platonic Topological Map)는 자기 지도 시각 인코더(Self-supervised Visual Encoder)로부터 기하학적 및 의미적 노드 거리를 융합하며, 짝지어진 시각-언어 데이터 없이 블라인드 매칭(Blind Matching)을 통해 언어 목표를 기반화한다. HM3D-IIN, OVON, MP3D 기반 R2R-CE를 포함한 시뮬레이션 벤치마크와 Unitree Go2 로봇에의 실제 배치를 통한 광범위한 실험 결과는, PlatonicNav가 명시적 교차 양식 훈련 없이도 작업, 양식, 및 구현체(Embodiment) 전반에 걸쳐 일반화됨을 입증한다. 코드: https://github.com/AIGeeksGroup/PlatonicNav. 웹사이트: https://aigeeksgroup.github.io/PlatonicNav.

English

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.