PlatonicNav : Dévoilement de la correspondance sémantique en navigation par cartes topologiques platoniques

Résumé

La navigation visuelle incarnée, où un agent perçoit un environnement complexe et agit pour atteindre un objectif à partir d'entrées sensorielles brutes, sous-tend un large éventail d'applications telles que la robotique de service domestique, la robotique d'assistance et l'exploration autonome à grande échelle. Cependant, les tentatives récentes d'unifier la navigation vision-langage (VLN) et la navigation vers un objet-cible (ObjNav) restent au niveau de la fusion architecturale, de l'entraînement sur tâches mixtes et du pré-entraînement vision-langage à grande échelle, sans examiner si des encodeurs visuels et linguistiques entraînés indépendamment partagent déjà une structure sémantique commune. De plus, même les cartes topologiques centrées sur les objets ancrent encore les objectifs langagiers via une supervision cross-modale explicite, comme CLIP ou les grands modèles vision-langage, laissant en suspens la question de savoir si un tel ancrage est possible à partir d'une carte purement construite par la vision. Pour relever ces défis, nous étendons l'hypothèse de représentation platonicienne à la navigation incarnée et reformulons l'ObjNav uniquement visuelle, l'ObjNav cross-modale et la VLN comme trois interfaces différentes vers la même variété sémantique centrée sur les objets. Nous introduisons ensuite PlatonicNav, un cadre sans entraînement dont la carte topologique platonicienne fusionne les distances géométriques et sémantiques des nœuds à partir d'un encodeur visuel auto-supervisé, et ancre les objectifs langagiers via un appariement aveugle sans aucune donnée appariée vision-langage. Des expériences approfondies sur des bancs d'essai de simulation incluant HM3D-IIN, OVON et R2R-CE sur MP3D, ainsi que le déploiement sur Unitree Go2, démontrent que PlatonicNav généralise à travers les tâches, les modalités et les incarnations sans entraînement cross-modal explicite. Code : https://github.com/AIGeeksGroup/PlatonicNav. Site web : https://aigeeksgroup.github.io/PlatonicNav.

English

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.