PlatonicNav:以柏拉圖拓撲地圖揭示導航中的語義對應
PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps
June 1, 2026
作者: Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao
cs.AI
摘要
具身視覺導航——智能體從原始感官輸入中感知複雜環境並採取行動以達成目標——是家庭服務機器人、輔助機器人以及大規模自主探索等多種應用領域的基礎。然而,近期試圖統一視覺與語言導航(VLN)和物體目標導航(ObjNav)的努力仍停留在架構融合、混合任務訓練及大規模視覺語言預訓練的層面,並未探討獨立訓練的視覺編碼器與語言編碼器是否已共享共同的語義結構。此外,即使是以物體為中心的拓撲地圖,依然依賴如CLIP或大型視覺語言模型等明確的跨模態監督來對齊語言目標,這使得我們無法確定在純粹由視覺構建的地圖上是否也能實現此種對齊。為了解決這些問題,我們將柏拉圖式表徵假說延伸至具身導航領域,並將純視覺ObjNav、跨模態ObjNav以及VLN重新詮釋為同一物體中心語義流形的三種不同介面。我們進一步提出PlatonicNav——一個無需訓練的框架,其柏拉圖式拓撲地圖融合了來自自監督視覺編碼器的幾何與語義節點距離,並透過無需任何配對視覺語言資料的盲匹配來對齊語言目標。我們在HM3D-IIN、OVON以及基於MP3D的R2R-CE等模擬基準測試上進行了廣泛實驗,並在Unitree Go2上進行了實機部署,結果證明PlatonicNav在無需明確跨模態訓練的情況下,能夠跨任務、跨模態、跨本體進行泛化。代碼:https://github.com/AIGeeksGroup/PlatonicNav。網站:https://aigeeksgroup.github.io/PlatonicNav。
English
Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.