PlatonicNav: プラトニック位相地図を用いたナビゲーションにおける意味的対応関係の解明

要旨

具現化視覚ナビゲーション（エージェントが複雑な環境を知覚し、生の感覚入力から目標に到達するために行動する技術）は、家庭用サービスロボティクス、支援ロボティクス、大規模自律探査など、幅広い応用の基盤を成している。しかし、近年の視覚言語ナビゲーション（VLN）と物体目標ナビゲーション（ObjNav）を統合しようとする試みは、アーキテクチャの融合、混合タスク学習、大規模視覚言語事前学習の段階に留まっており、独立に学習された視覚エンコーダと言語エンコーダがすでに共通の意味構造を共有している可能性については検証されていない。さらに、物体中心のトポロジカルマップでさえ、CLIPや大規模視覚言語モデルなどの明示的なクロスモーダル教師信号を用いて言語目標を接地しており、純粋に視覚のみで構築されたマップからそのような接地が可能かどうかは未解決のままである。これらの課題に取り組むため、我々はプラトン的表現仮説を具現化ナビゲーションに拡張し、視覚のみのObjNav、クロスモーダルObjNav、VLNを、同一の物体中心意味多様体への3つの異なるインターフェースとして再定義する。さらに、学習不要のフレームワークであるPlatonicNavを導入する。そのPlatonicトポロジカルマップは、自己教師あり視覚エンコーダから幾何学的および意味的ノード距離を融合し、ペアとなる視覚言語データなしにブラインドマッチングを介して言語目標を接地する。HM3D-IIN、OVON、MP3D上のR2R-CEといったシミュレーションベンチマークにおける広範な実験と、Unitree Go2への実機展開により、PlatonicNavが明示的なクロスモーダル学習なしにタスク、モダリティ、エンボディメントにわたって汎化することが実証された。コード：https://github.com/AIGeeksGroup/PlatonicNav。ウェブサイト：https://aigeeksgroup.github.io/PlatonicNav。

English

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.