NaviTrace：视觉语言模型具身导航能力评估

摘要

視覺語言模型在廣泛的任務與場景中展現出前所未有的性能與泛化能力。將這些基礎模型整合到機器人導航系統中，為構建通用機器人開闢了新途徑。然而，當前對這些模型導航能力的評估仍受制於高昂的真實環境測試、過度簡化的模擬系統以及有限的基準數據集。我們推出NaviTrace——一個高質量的視覺問答基準測試框架：模型接收指令與具身類型（人類、足式機器人、輪式機器人、自行車）後，需在圖像空間中輸出二維導航軌跡。基於1000個場景與3000餘條專家軌跡數據，我們採用新提出的語義感知軌跡評分系統，對八種前沿視覺語言模型進行系統性評估。該指標融合動態時間規整距離、目標端點誤差以及基於像素級語義的具身條件懲罰機制，並與人類偏好具有相關性。評估結果揭示了因空間基礎定位與目標識別能力不足導致的模型與人類性能間的持續差距。NaviTrace為真實環境機器人導航建立了可擴展、可複現的評估基準，相關基準數據集與排行榜詳見：https://leggedrobotics.github.io/navitrace_webpage/。

English

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

NaviTrace：视觉语言模型具身导航能力评估

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

摘要

Support