ChatPaper.aiChatPaper

PlatonicNav:利用柏拉图拓扑地图揭示导航中的语义对应

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

June 1, 2026
作者: Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao
cs.AI

摘要

具身视觉导航中,智能体通过原始感官输入感知复杂环境并采取行动达成目标,支撑着家庭服务机器人、辅助机器人以及大规模自主探索等广泛的应用场景。然而,近期将视觉语言导航(VLN)与物体目标导航(ObjNav)统一的尝试仍停留在架构融合、混合任务训练和大规模视觉语言预训练层面,尚未验证独立训练的视觉与语言编码器是否已共享共同的语义结构。此外,即使基于物体中心的拓扑地图,仍需借助CLIP或大型视觉语言模型等显式跨模态监督来锚定语言目标,这引发了疑问:纯粹基于视觉构建的地图是否也能实现这种锚定。为解决这些挑战,我们将柏拉图表示假设拓展至具身导航,并将纯视觉ObjNav、跨模态ObjNav与VLN重新定义为同一物体中心语义流形的三种不同接口。我们进一步提出无训练框架PlatonicNav,其柏拉图拓扑地图融合了自监督视觉编码器中的几何与语义节点距离,并通过盲匹配(无需任何配对的视觉语言数据)锚定语言目标。在仿真基准(包括基于MP3D的HM3D-IIN、OVON和R2R-CE)上的大量实验,以及在宇树Go2上的部署,表明PlatonicNav无需显式跨模态训练即可跨任务、跨模态、跨具身形态泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。
English
Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.