ChatPaper.aiChatPaper

NaviTrace:视觉语言模型具身导航能力评估

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

October 30, 2025
作者: Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, Jonas Frey
cs.AI

摘要

视觉语言模型在广泛的任务和场景中展现出前所未有的性能与泛化能力。将这些基础模型集成到机器人导航系统中,为构建通用机器人开辟了新路径。然而,当前对这些模型导航能力的评估仍受限于昂贵的真实世界试验、过度简化的仿真环境以及有限的基准测试。我们推出NaviTrace——一个高质量的视觉问答基准测试集:模型接收指令与具身类型(人类、腿式机器人、轮式机器人、自行车)后,需在图像空间输出二维导航轨迹。基于1000个场景和3000余条专家轨迹,我们采用新提出的语义感知轨迹评分系统性地评估了八种前沿视觉语言模型。该指标融合了动态时间规整距离、目标终点误差以及基于像素级语义的具身条件惩罚机制,并与人类偏好保持相关性。评估结果表明,由于空间定位和目标识别能力不足,现有模型与人类表现存在系统性差距。NaviTrace为真实世界机器人导航建立了可扩展、可复现的基准测试体系。基准数据集与排行榜详见https://leggedrobotics.github.io/navitrace_webpage/。
English
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.
PDF131January 19, 2026